Under The Hood Of The Batch Engine: SIMD With SQL Server 2016 CTP 2

Most CPUs as we know them are based on the “Von Neumann architecture”:

Van Neumann

The Von Neumann architecture is based around the concept of the fetch execute cycle, instructions and data are fetched from memory, a control unit decodes and schedules instructions which are executed by an arithmetic logic unit. In modern Intel processors, the fetching, decoding and scheduling of instructions are executed in what is known as the “Front end” and the “Back end” of the processor contains the arithmetic logic units:

FrontBackEnd

The CPU registers are the smallest and fastest part of the memory hierarchy on the CPU die and are 64 bytes in width. Since the Pentium 5 circa 1997, Intel has developed special instructions and registers which allow multiple data items to be processed simultaneously by single instructions, the generic term for this is Single Instruction Multiple Data.

SIMD Where We Have Come From

The following image is taken from the Intel White Paper: “The Significance of SIMD, SSE and AVX”:

SIMD evolution

SIMD Where We Are Going: The Future AVX

Intel road map

Why Is AVX Important ?

Simply put, the wider the register, the more data items we can process per CPU clock cycle and 256/512 bit AVX registers give us more bang for our buck than 128 bit SSE 2 registers.

What Flavor of SIMD Is SQL Server 2016 CTP 2 Using ?

Evidence of the use of SIMD in SQL Server 2016 CTP 2 can be found in the
de-compression element of a column store scan:

SIMD in stack

The _m128i on the last line highlighted is documented here:

m128i

This indicates that the batch engine is using streaming SIMD extensions 2 instructions. The query this stack trace is from involved purely integer data, in order to see if AVX instructions can be used I need to switch to my other test server which has Sandybridge based Xeon processors and use floating point data. Here is how I created this test data:

WITH generator AS (
    SELECT     TOP 1000000 id = ROW_NUMBER() 
                   OVER (ORDER BY a)
    FROM       (SELECT a = 1
                FROM   master.dbo.syscolumns) c1
    CROSS JOIN master.dbo.syscolumns c2)
SELECT      a.DateKey AS OrderDateKey
           ,CAST(((id - 1) % 1048576) AS money) AS BigColumn1
           ,CAST(((id - 1) % 1048576) AS money) AS BigColumn2
           ,CAST(((id - 1) % 1048576) AS money) AS BigColumn3
INTO       BigFloatDataColumnStore
FROM       generator
CROSS JOIN dbo.DimDate a;

The significance of using floating point data is because the incarnation of AVX in Sandybridge based Xeons only supports wide registers for floating point data, these could be used to store integer data. However, if Microsoft have enabled the use of AVX instructions, it is more likely that they have done this for floating point operations in the first instance.

sandybridge

From what I can see the SQL Server development team have coded to a SIMD register and instruction set that harks back to 2000.

Addendum

AVX comes of age with the Haswell process which supports AVX2, this has integer registers and a much richer instruction set, without either Microsoft coming out in public and announcing that they are leveraging AVX2 or having access to a machine with Haswell processors on which to test this, there is no way of knowing (until such time) that Microsoft are using this.

Test Drive: SQL Server 2014 SP1 Versus SQL Server 2016 CTP2

  • Dell Precision T5500test hardware
  • Windows server 2012 r2
  • 2 x 6 core 2.4 Ghz Xeons (Westmere)
  • 32GB triple channel 1600Mhz DDR3 memory
  • 1 x Fusion IO ioDrive 2 Duo 2.3TB
  • 2 x SanDisk Extreme Pro 480GB SSD
  • Trace flags 1117, 1118, 8008, 2330 and -E
  • Lock pages in memory privilege used

This is how the test data is created:

WITH generator AS (
    SELECT     TOP 1000000 id = ROW_NUMBER() 
                   OVER (ORDER BY a)
    FROM       (SELECT a = 1
                FROM   master.dbo.syscolumns) c1
    CROSS JOIN master.dbo.syscolumns c2)
SELECT      a.DateKey AS OrderDateKey
           ,CAST(((id - 1) % 1048576) AS bigint) AS BigColumn1
           ,CAST(((id - 1) % 1048576) AS bigint) AS BigColumn2
           ,CAST(((id - 1) % 1048576) AS bigint) AS BigColumn3
INTO       BigDataColumnStore
FROM       generator
CROSS JOIN dbo.DimDate a;

and this is the test query:

SELECT    CalendarQuarter
         ,SUM(BigColumn1)
         ,SUM(BigColumn2)
         ,SUM(BigColumn3)
FROM     [dbo].[BigDataColumnStore] b
JOIN     [dbo].[DimDate] d
ON       b.OrderDateKey = d.DateKey
GROUP BY CalendarQuarter

I have found that enabling hyper-threading and disabling adjacent cache line resulted in the lowest elapsed time for the query above, lets give this a try for both version of the database engine, first with a warm then cold column store object pool:

warm cs obj pool

With a warm column store object pool the SQL Server 2016 CTP 2 batch engine out performs its 2014 counter part, unfortunately this is not true when the test query is run using a cold column store object pool. I also collected CPU time via SET STATISTICS TIME ON:

cold cs obj pool CPU time

There are two different ways of interpreting CPU utilization, something is burning up more CPU cycles because:

  • It is performing some operation in an inefficient manner.
  • It is designed and coded to be scale-able and is burning more CPU cycles because it is subject to minimal locking and or latching.

These are the most CPU intensive modules captured in an ETW trace of the test query with a DOP of 24 on SQL Server 2014 SP1:

SQL 2014 most CPU intensive modules

sqlmin.dll encapsulates the database engine runtime and storage engine, lets drill down into where the time is going in this module:

SQL 2014 funcs

Now lets do the same for SQL Server 2016 CTP 2:

SQL 2016 most CPU intensive modules

as before here are the most CPU intensive functions:

SQL 2016 funcs

Note the line for the function RdpBitUnpack::BitUnpackSIMD, my suspicion is that this replaces what CpbBatch::ConvertByteArrayToSelectionVector does in SQL Server 2014, these two call stack excerpts support this hunch:

SQL Server 2014 CTP 1

ByteArrayToSelectionVector

SQL Server 2016 CTP 2

SIMD unpack

The fact that 15,599 sampled calls to the SIMD function are captured compared to 14,162 calls to CpbBatch::ConvertByteArrayToSelectionVector suggests there might be a bug in SQL Server 2016. Put another way, if we are unpacking a vector (batch) using 128 bit wide registers why are we doing this 15,599 (sampled) times when doing this with a 64 bit register incurs 14,162 sampled calls to CpbBatch::ConvertByteArrayToSelectionVector ?.

I was going to dig deep into why the SQL Server 2014 SP1 and SQL Server 2016 CTP 2 disparity exists, however there is probably more value in revisiting this whole exercise once CTP 3 or the RTM version of SQL Server 2016 is out.

Takeaways, What Have We Learned ?

  • SQL Server 2016 CTP 2 is using SIMD for the decompression of batches.
  • SQL Server 2016 CTP 2 is using the old 128 bit SIMD registers and instructions, this remains the case irrespective of whether the processor supports AVX and the test data contains floating point numbers.
  • I will caveat this final point by emphasizing the fact that the current release of SQL Server 2016 is not RTM, however performance is disappointing and this may be down to a potential bug.

One thought on “Under The Hood Of The Batch Engine: SIMD With SQL Server 2016 CTP 2

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s