Super Scaling The SQL Server 2014 Parallel Insert: Part 2, Clustered Index Scans Versus Heap Scans

Following a comment I received on twitter from @sqlcrossjoin I would like to focus on why a clustered index scan outperformed a heap scan for my test data. I will use the Windows Performance Analysis tool kit to investigate this in detail, but first I will recap on how my storage is set up:

disksMy Fusion IO ioDrive2 duo card appears to the operating system as two disks, as the name ‘duo’ suggests; it is effectively two flash PCIe cards sandwiched together. The two 480Gb SanDisk Extreme Pro drives appear as drives 2 and 3, the E and C drives respectively. This is how the source data is created:

;WITH generator AS (
     SELECT     TOP 50000 id = ROW_NUMBER() OVER (ORDER BY a)
     FROM       (SELECT a = 1
                 FROM     master.dbo.syscolumns) c1
     CROSS JOIN master.dbo.syscolumns c2
)
SELECT  g1.id % 24                AS Col1
       ,g1.id + 1                 AS Col2
       ,g1.id + 2                 AS Col3
       ,g1.id + 3                 AS Col4
       ,g1.id + 4                 AS Col5
       ,CASE g1.id % 2
            WHEN 1
                THEN AddressLine1
                ELSE AddressLine2
        END                       AS Col6
       ,CASE g1.id % 7
            WHEN 1
                THEN AddressLine1
                ELSE AddressLine2
        END                       AS Col7
       ,FirstName                 AS Col8
       ,MiddleName                AS Col9
       ,LastName                  AS Col10
       ,CustomerAlternateKey      AS Col11
INTO   BigData
FROM   generator g1
CROSS JOIN DimCustomer

I have a heap which is hash partitioned across 24 partitions and a clustered index partitioned in the same way. In order to rule out the affects of external fragmentation, I have created each object in its own file group.

There are my two test queries which I test using a cold buffer cache:

SELECT  COUNT([Col1])
       ,COUNT([Col3])
       ,COUNT([Col4])
       ,COUNT([Col5])
       ,COUNT([Col6])
       ,COUNT([Col7])
       ,COUNT([Col8])
       ,COUNT([Col9])
       ,COUNT([Col10])
       ,COUNT([Col11])
FROM  [dbo].[BigDataHashPart24_CI]
SELECT  COUNT([Col1])
       ,COUNT([Col2])
       ,COUNT([Col3])
       ,COUNT([Col4])
       ,COUNT([Col5])
       ,COUNT([Col6])
       ,COUNT([Col7])
       ,COUNT([Col8])
       ,COUNT([Col9])
       ,COUNT([Col10])
       ,COUNT([Col11])
FROM  [dbo].[BigDataHashPart24_CI]

Averaged out over 10 executions and a cold buffer cache, the query against the clustered index takes 40377 ms to run. The same query against the heap has an average execution time 45842 ms per execution for ten executions. This is what the distribution of rows per worker thread looks like for the heap scan:

heap scan row distrib

and this is the row distribution per worker thread for the clustered index scan:

CI scan row distrib

This is the wait activity for a single execution of the query against the heap, I’ve used an extended event to isolate the wait activity for the execution of the test query:

heap scan non idle waitsand this is what the wait activity looks like for the clustered index scan:

CI scan non idle waitsMy first reaction on seeing these results was that I should double check them, which is exactly what I did and they do stand up to scrutiny. On face value the clustered index scan is driving the IO sub system twice as hard as the heap scan.  I used SET STATISTICS IO ON to obtain an initial grasp as to what sort of IO was taking place:

io stats

The 48 physical reads this reports for the clustered index scan from a cold buffer cache did not inspire me with confidence, therefore I went back to the Windows Performance Toolkit. Before diving into using this, I should issue a word of caution when using this tool, generally it is very light weight, however I found that following the advice in this blog post throttled a throughput of 3300Mb/s to 200Mb/s,

For the heap scan we can see 68,736 ReadFileScatter system calls are being made, as aside, Bob Ward’s deck on SQL Server IO Internals‘s is recommended reading around this subject:

heap scan

The number of calls to this ReadFileScatter is significantly more for the clustered index scan, 107,434 to be precise:

clustered index scan

The most critical key performance metric for sequential scan performance is IO throughput, this is the IO throughput for the heap scan:

IO throughput heap scan

and this is what the throughput looks like for the clustered index scan:

IO throughput CI scan

We can see that apart from the first 3.5 seconds of the clustered index scan, the IO throughput being delivered is virtually constant at around 3300Mb/s unlike the throughput for the heap scan. Lets now have a look at the composition of the IO throughput by breaking it down into counts per IO size, this is for the heap scan:

HeapScan IO sizes

 and now for the clustered index scan:

CI Scan IO sizes

The variance in IO sizes for the clustered index scan is significantly less than that for the heap scan and skewed towards 64K. Can we do better by forcing the IO size ?, there are two options for this, firstly the Microsoft white paper “Tuning Options for SQL Server when running in high performance workloads”; states that read-a-heads can be disabled:

Trace flag 652: Disable page pre-fetching scans

Trace flag 652 disables page pre-fetching during scans. You can turn on trace flag 652 at startup or in a user session. When you turn on trace flag 652 at startup, the trace flag has global scope. When you turn on trace flag 652 in a user session, the trace flag has session scope. If you turn on trace flag 652, SQL Server no longer brings database pages into the buffer pool before these database pages are consumed by the scans. If you turn on trace flag 652, queries that benefit from the page pre-fetching feature exhibit low performance.

Paul White ( @SQL_KIWI ) has mentioned the limitations of this trace flag with SQL Server 2012 in response to a DBA Stack Exchange question here. Lets see if the situation has changed with SQL Server 2014 CU 2, with trace flag 652 enabled, this is what the IO throughput looks like for the heap scan:

TF635_io_throughput

and these are the sizes of the reads:

TF635_io_sizes

It appears that trace flag 652 does work for heap scans. The theory behind trace flag 652 is that if you can read the data fast enough, its a waste of CPU cycles performing read-a-heads, however,  based on the results here, it would seem that the sweet spot for IO throughput is closer to 64K than 8K.

The importance of putting these results into the context of the test hardware cannot be overstated. If the hardware included some form of storage array with a large cache, the likelihood is that it would favour aggressive read-a-heads, which would reverse these results and make the heap scan faster.

The second available option for influencing IO sizes is to encourage aggressive read-a-head activity. The start-up parameter: -E, documentation for which can be found here will cause the database engine to allocate four extents per datafile at a time, the rationale being that this should encourage more read-a-head friendly extent allocation. I therefore rebuild the heap and clustered index with this start-up parameter in use, below are the IO throughputs and IO sizes by count for the heap scan:

HeapScan_E_throughput

HeapScan_E_IO_sizes

and the clustered index scan:

CI Scan E Throughput

CI Scan E IO sizes

The use of the -E flag results in marginally better performance for both the clustered index scan and the heap scan.

In conclusion the clustered index scan out performs the heap scan because it results in IO sizes conducive to getting consistently high IO throughput from the hardware.

In the next post in this series I will look at the effect of the number of partitions on scan performance.

5 thoughts on “Super Scaling The SQL Server 2014 Parallel Insert: Part 2, Clustered Index Scans Versus Heap Scans

  1. I know from experience that SQL Server will issue full 512KB IOs on both CI and heaps in all cases if it can.

    The IO patterns in your case suggest that there is fragmentation. Why else would SQL Server needlessly split IOs? You should check this.

    I would not expect any difference in IO throughput if no fragmentation is present. If any is present specifying NOLOCK helps with the CI because it causes an allocation-order scan.

    • Yes, there is some fragmentation present, I’m trying to get a handle on why this is. Both the heap and clustered index have partitioning schemes which are associated with completely different file groups which nothing else resides in. Also the the two file groups were created with data file pre-sized at 10 Gb, after these two objects were populated none of the data files have grown in size, this rules out external fragmentation. As for internal fragmentation, the heap was created with an INSERT SELECT which is single threaded in nature, the clustered index is sorted on Col1 which contains an even distribution of values from 0 through to 23, my partitioning function maps each value to its own partition, therefore there should not be any internal fragmentation. I will however double check my results to ensure I have not dropped the ball anywhere and add an addendum to this post, as I am more concerned with my results being correct than being proven wrong.

      • From this description I can’t tell either what’s wrong. Sounds good.

        You could map each partition to its own file group. Whatever is causing fragmentation here – hopefully that kills it.

        SQL Server is awfully prone to fragmentation without any good reason. It could simply extend tables in contiguous chunks of 16MB or so.

      • If you create / rebuild indexes with a MAXDOP of 1 that helps, its well documented that doing this with any DOP higher than 1 can cause fragmentation, also as suggested, I’m going to have one file group per partition. I have a strange feeling, which is yet to be substantiated and I may be proven wrong that fragmentation might work you you ( me ) with the type of storage. Again I would caveat anything I do with the statement that you taking the results with the context of the test hardware is of paramount importance.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s