Super Scaling The SQL Server 2014 Parallel Insert: Part 1, Investigating Scan Performance

One of the unheralded features of SQL Server 2014 is the parallel insert; provided through the SELECT INTO statement. In this blog post series I’d like to look at pushing this feature to its limits. This is my test server, which differs slightly from I’ve used in the past:

2 x 6 core Xeon E5645 ( Westmere ) 2.4Mhz
12 Gb memory
iodrive 2 Duo 2.4Tb
Windows Server 2012 R2
SQL Server 2014 CU 2

The following query is used to produce the test data:

;WITH generator AS (
     SELECT     TOP 50000 id = ROW_NUMBER() OVER (ORDER BY a)
     FROM       (SELECT a = 1
                 FROM     master.dbo.syscolumns) c1
     CROSS JOIN master.dbo.syscolumns c2
SELECT % 24                AS Col1
       , + 1                 AS Col2
       , + 2                 AS Col3
       , + 3                 AS Col4
       , + 4                 AS Col5
       ,CASE % 2
            WHEN 1
                THEN AddressLine1
                ELSE AddressLine2
        END                       AS Col6
       ,CASE % 7
            WHEN 1
                THEN AddressLine1
                ELSE AddressLine2
        END                       AS Col7
       ,FirstName                 AS Col8
       ,MiddleName                AS Col9
       ,LastName                  AS Col10
       ,CustomerAlternateKey      AS Col11
INTO   BigData
FROM   generator g1
CROSS JOIN DimCustomer

Col1 is populated using modulo 24, this is by design and not by accident and will come into play later 😉

In this post I will look at how fast the source table for the SELECT INTO can be scanned. Lets start with the following query:

FROM   [dbo].[BigData] WITH (NOLOCK)

This executes in 1 minute and 10 seconds with a maximum CPU utilization of 82% and a top IO throughput of 1900 Mb/s, The following are the waits statistics of this query:


Performance is throttled back by latching, this is what the latch activity looks like:


ACCESS_METHODS_DATASET_PARENT is a latch taken out when a worker thread asks the parallel table scan enumerator to for a range of pages to read. Someone sitting at a poker table asking the card dealer to deal them a hand of cards is a very good analogy for this.

latchingACCESS_METHODS_DATASET_PARENT latch activity can be reduced by partitioning the table. As mentioned, the part of the database engine that divvies up page ranges to child threads acts like a card dealer, the more partitions we have the more card dealers we have. With a heap and only one “Card dealer” something has to serialize access to the card dealer, hence this latch. I have 24 logical processors available to SQL Server, therefore 24 partitions, one “Card dealer” per logical processor seemed like a sensible number to go with. Here is the partition function:

CREATE PARTITION FUNCTION [HashPartitionFunc24](bigint) 
FOR VALUES (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)

and here is the partition scheme:

AS PARTITION [HashPartitionFunc24]

Finally, this is the create statement for my hash partitioned version of the BigData table:

CREATE TABLE [dbo].[BigDataHeap24](
     [Col1]  [bigint]        NULL
    ,[Col2]  [bigint]        NULL
    ,[Col3]  [bigint]        NULL
    ,[Col4]  [bigint]        NULL
    ,[Col5]  [bigint]        NULL
    ,[Col6]  [nvarchar](120) NULL
    ,[Col7]  [nvarchar](120) NULL
    ,[Col8]  [nvarchar](50)  NULL
    ,[Col9]  [nvarchar](50)  NULL
    ,[Col10] [nvarchar](50)  NULL
    ,[Col11] [nvarchar](15)  NULL
) ON HashPartitionScheme24(Col1)


If we repeat the simple SELECT COUNT statement on this table we get an IO throughput of around 2060Mb/s and a peek CPU usage of 100%, but the elapsed time is only 1 minute and 5 seconds. However, where as the waits on LATCH_EX accounted for the best part of 92% of the total waits before, it now accounts for less than 1% of the total wait activity. The IO bandwidth of the hardware is now the top bottleneck:

No latching waits

Wait activity by its very nature never reflects service time for which we have 100% CPU utilization, what is causing this ?, the stack trace might provide some clues:


SQL Server accounts for 60.97% of the total weight of what xperf sampled, whilst the query was running, the call to:


accounts for roughly half the CPU time expended according to the stack trace. Why is SQL Server interested in column values when performing a count ?, the answer is that it needs to check for NULLs, because any count on any column always excludes NULLs. As a pure academic exercise lets repeat the query against the hash partitioned table, but this time the table will be re-created with NOT NULL constraints on each column. For the purposes of expedience, the following statement is used to re-create the test data minus any NULL values:

INSERT INTO [dbo].[BigDataHeap24] WITH (TABLOCK)
SELECT  [Col1]
       ,ISNULL([Col6] , 'X')
       ,ISNULL([Col7] , 'X')
       ,ISNULL([Col8] , 'X')
       ,ISNULL([Col9] , 'X')
       ,ISNULL([Col10], 'X')
       ,ISNULL([Col11], 'X')
FROM   [dbo].[BigData] WITH (NOLOCK)

Elapsed time for the query drops down to a 55 seconds, peak CPU utilization drops to 31% and we nearly touch 3000 Mb/s in peak IO throughput.

taskmgrThe Fusion IO card is being pushed near to its limit of 3Gb/s for sequential reads, but this is only managing to keep a third of the total CPU capacity of the server busy. This ties in with something I mentioned during my blog post series on the batch engine; the use of SIMD; machine instructions which allow the CPU to perform the same operation on multiple data points simultaneously:


Joe Chang has blogged about how this could help optimize table scans here, I will borrow this example from Joe’s post to illustrate how this might work ( everything in italics is based on Joe’s work ):

Assuming that the appropriate lock and latches have been taken out, the page header read and that the column offsets, sizes and data types have been read from the system table and that the columns being accessed are for fixed length not null data types, an extremely simplified instruction sequence for accessing the column data is:

1) Load (2 byte) offset for first row at end of page
2) Load column offset
3) Compute column address (page address + row offset + column offset)
4) Load column from memory location
5) Consume column (plus data validity tests), write results

For the purposes of brevity and simplicity, I will assume that performing each of the above steps takes a single clock cycle and that we are not taking into account any CPU cycles required to load data from anywhere in the CPU cache hierarchy and there are no memory stalls.

This sequence is repeated for each column and each row. Assuming that the SIMD registers are 128 bits wide and that row offsets, column offsets and column data type sizes are all 2 bytes wide and 3 of these special purpose registers exist, this information can be stored as follows:

XMM1:             Row offsets:        1, 2, 3, 4, 5, 6, 7, 8
XMM2:             Column offsets:   1, 2, 3, 4, 5, 6, 7, 8
XMM3:             Column sizes:     1, 2, 3, 4, 5, 6, 7, 8

If we are interested in accessing eight columns, the approach that does not use SIMD takes 40 clock cycles, where as the SIMD approach only takes 5, this is a reduction in the number of CPU clock cycles used by an order of magnitude.

Switching focus to the hardware, I wanted to increase the IO bandwidth I had, in addition to the ioDrive2 duo card I had two Sandisk Extreme Pro 480Gb solid state drives rated at 550 Mb/s for sequential reads. To utilise these two different types of flash storage I created a file group with one data file on the San Disk SSDs for every six data files on the ioDrive2, this excerpt illustrates how I did this, the ‘G; drive is the ioDrive and the C drive a Sandisk SSD:

ALTER DATABASE [AdventureWorksDW2014]
 ( NAME       = N'FIO_01'
  ,FILENAME   = N'G:\SQLDATA\FIO_01.mdf'
  ,SIZE       = 20971520KB
,( NAME       = N'FIO_02'
  ,FILENAME   = N'G:\SQLDATA\FIO_02.mdf'
  ,SIZE       = 20971520KB
,( NAME       = N'FIO_03'
  ,FILENAME   = N'G:\SQLDATA\FIO_03.mdf'
  ,SIZE       = 20971520KB
,( NAME       = N'FIO_04
  ,FILENAME   = N'G:\SQLDATA\FIO_04.mdf'
  ,SIZE       = 20971520KB
,( NAME       = N'FIO_05'
  ,FILENAME   = N'G:\SQLDATA\FIO_05.mdf'
  ,SIZE       = 20971520KB
,( NAME       = N'FIO_06'
  ,FILENAME   = N'G:\SQLDATA\FIO_06.mdf'
  ,SIZE       = 20971520KB
,( NAME       = N'FIO_07'
  ,FILENAME   = N'C:\SQLDATA\FIO_01.mdf'
  ,SIZE       = 20971520KB

With the additional two solid state disk incorporated into the FIO file group, I now get a peak IO throughput of 3426Mb/s, a sustained IO throughput at around 3300Mb/s, a peak CPU consumption of 40% and an elapsed execution time of 46 seconds.

I’d like to take a step back and consider what sort of IO bandwidths are required to avoid CPU core starvation in modern CPUs. The original fast track methodology was based on the Intel Core 2 architecture, a Core 2 CPU can consume 200 Mb/s, my Westmere based CPU cores can consume 347 Mb/s. The next generation micro architecture on from Nehalem/Westmere; Sandybridge includes an integrated IO controller, also referred to as the “IO hub”, this increases the core consumption rate to somewhere in the range of an additional 10~15% on top of Westmere ( to around 390 Mb/s ). With my simple ‘Academic’ SELECT COUNT test, I am only seeing 40% CPU utilization with a sustained IO throughput of 40% because I’m not looking at the values inside the columns of the BigData table.

Here is what the Sandybridge architecture looks like:


Assuming that the path from physical disk to server CPU involves a SAN, this IO path involves the hardware components below, each of which has to be specified adequately in order to avoid CPU core starvation:


The best case scenario for the throughput of a 15K RPM SAS drive performing sequential reads is 120Mb/s, assuming a CPU has 6 Westmere cores, it would take ( 12 x 347 ) / 120 = 35 disks to avoid CPU core starvation at the socket level. An Emulex 16Gb LPe 16000 Fibre channel HBA ( single port ) is capable of delivering up to 1600 Mb/s, assuming the dual port LPe 16002 can deliver twice this, the available bandwidth now becomes 3200 Mb/s, therefore two of these host bus adapters would be required to keep my server busy, assuming that the SAN involved can deliver around 4164 Mb/s in bandwidth. The Haswell micro architecture is expected to deliver a 10~15% increase in data consumption rate over the Sandybridge micro architecture.

As the available CPU capacity is still not being fully used, I tried the same simple SELECT COUNT test again with row compression, page compression and also when the hash partitioned heap is turned into a clustered index. Here are the results in full, including what happens if we create a clustered column store index ( added for the purposes of completeness ):

scan rates

The column store index scan is blindingly fast, but you have to put this into context by taking into account the 1 hour and 15 minutes taken to create it in the first place.

If we were performing any operation on the columns other than a COUNT, CPU capacity would be exhausted and the extra demand placed on them by decompression would send performance backwards.

The difference between a heap and a clustered index is one of the most fundamental concepts in the world of SQL Server, this therefore warrants some deeper investigation.

This is the stack trace for the SELECT COUNT query that uses the heap with 24 partitions:

heap scan stacktrace

The total weight for SQL Server, weight being the time sampled across all CPU cores in ms, is 440,627, the weight for the  full table scan is 399,515, the last line I have highlighted is for the filter manager, this has a weight of 63,584, 14.5% of the total weight for SQL Server. The filter manager providers a frame-work which filter mini drivers can hook into, one of the main use cases for this is anti-virus software. I will cover this off in a later post in this series, suffice it to say that anti-virus software is not the cause of the filter manager activity here. I have also highlighted a line which relates to the buffer pool, I will explain the reason behind this by taking a slight diversion. In order to make any piece of software run fast, several things are desirable:

  • Use of a CPU with a fast clock and good single threaded performance.
  • As few last level cache misses as possible, last level cache misses also known as memory stalls can cost upwards of 160 CPU cycles.
  • Minimal context switching, a context switch can cost upwards of 10,000 CPU cycles.
  • The use of SIMD instructions.
  • Cache line align memory structures, i.e. size data structures such that they are multiple of 64 bits, this helps to avoid split register operations.
  • Avoid branch mis-prediction, these can cost around 50 CPU cycles.
  • Finally minimise code path lengths.

I’ve highlighted the line with the call to sqlmin.dll!BPool::ReadAhead because I have a sneaking suspicion that reading from the ioDrive as an extension of the buffer pool might involve a shorter code path than that involved in accessing it as primary storage. I intend to cover this is a future blog post.

Below is the stack trace for the SELECT COUNT query on clustered index, as with the heap it is hash partitioned with 24 partitions

clust index scan stacktrace

The total weight for SQL Server is 668,231, the weight for the part of the stack from sqlmin.dll!CQScanNew::GetRow onwards is 466,771. The second line I have highlighted ( from the top ) is for sqlmin.dll!IndexRowScanner::MoveKeyOrderToRowOnNextPage, this is not present in the stack trace for the heap scan, it has a weight on its own of 56,791. Finally I’ve also highlighted the lines in the stack trace associated with buffer pool access and filter manager activity. Note that the total weight for SQL Server activity relating to the clustered index scan query is significantly more than that for the heap scan at 668,231 versus 440,627. The weight for the clustered index scan itself in isolation is 466,771 compared to 399,515 for the heap scan.

To summarise this first blog post, we have covered:

  • The use of hash partitioning to reduce
  • The cost of examining the contents of a tables columns has been quantified and a potential solution to this as put forwards by Joe Chang highlighted.
  • How to combine the fast ioDrive2 duo card and not so fast SanDisk SSDs via manual file group data file placement in order to increase IO throughput.
  • The IO bandwidths required to avoid core starvation in modern CPUs.
  • The different ways of storing data to optimize scan performance.
  • What happens with a heap scan versus a clustered index scan at a deep level within the database engine.

In the next couple of posts in this series I intend to cover:

  • Whether the buffer pool extension feature provides any extra performance and what the call stack for this looks like.
  • Whether the time being expended in the filter driver can be reduced.
  • The effect of hash partitioning the heap with different numbers of partitions.

One thought on “Super Scaling The SQL Server 2014 Parallel Insert: Part 1, Investigating Scan Performance

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s