Following a comment I received on twitter from @sqlcrossjoin I would like to focus on why a clustered index scan outperformed a heap scan for my test data. I will use the Windows Performance Analysis tool kit to investigate this in detail, but first I will recap on how my storage is set up:
My Fusion IO ioDrive2 duo card appears to the operating system as two disks, as the name ‘duo’ suggests; it is effectively two flash PCIe cards sandwiched together. The two 480Gb SanDisk Extreme Pro drives appear as drives 2 and 3, the E and C drives respectively. This is how the source data is created:
;WITH generator AS ( SELECT TOP 50000 id = ROW_NUMBER() OVER (ORDER BY a) FROM (SELECT a = 1 FROM master.dbo.syscolumns) c1 CROSS JOIN master.dbo.syscolumns c2 ) SELECT g1.id % 24 AS Col1 ,g1.id + 1 AS Col2 ,g1.id + 2 AS Col3 ,g1.id + 3 AS Col4 ,g1.id + 4 AS Col5 ,CASE g1.id % 2 WHEN 1 THEN AddressLine1 ELSE AddressLine2 END AS Col6 ,CASE g1.id % 7 WHEN 1 THEN AddressLine1 ELSE AddressLine2 END AS Col7 ,FirstName AS Col8 ,MiddleName AS Col9 ,LastName AS Col10 ,CustomerAlternateKey AS Col11 INTO BigData FROM generator g1 CROSS JOIN DimCustomer
I have a heap which is hash partitioned across 24 partitions and a clustered index partitioned in the same way. In order to rule out the affects of external fragmentation, I have created each object in its own file group.
There are my two test queries which I test using a cold buffer cache:
SELECT COUNT([Col1]) ,COUNT([Col3]) ,COUNT([Col4]) ,COUNT([Col5]) ,COUNT([Col6]) ,COUNT([Col7]) ,COUNT([Col8]) ,COUNT([Col9]) ,COUNT([Col10]) ,COUNT([Col11]) FROM [dbo].[BigDataHashPart24_CI]
SELECT COUNT([Col1]) ,COUNT([Col2]) ,COUNT([Col3]) ,COUNT([Col4]) ,COUNT([Col5]) ,COUNT([Col6]) ,COUNT([Col7]) ,COUNT([Col8]) ,COUNT([Col9]) ,COUNT([Col10]) ,COUNT([Col11]) FROM [dbo].[BigDataHashPart24_CI]
Averaged out over 10 executions and a cold buffer cache, the query against the clustered index takes 40377 ms to run. The same query against the heap has an average execution time 45842 ms per execution for ten executions. This is what the distribution of rows per worker thread looks like for the heap scan:
and this is the row distribution per worker thread for the clustered index scan:
This is the wait activity for a single execution of the query against the heap, I’ve used an extended event to isolate the wait activity for the execution of the test query:
My first reaction on seeing these results was that I should double check them, which is exactly what I did and they do stand up to scrutiny. On face value the clustered index scan is driving the IO sub system twice as hard as the heap scan. I used SET STATISTICS IO ON to obtain an initial grasp as to what sort of IO was taking place:
The 48 physical reads this reports for the clustered index scan from a cold buffer cache did not inspire me with confidence, therefore I went back to the Windows Performance Toolkit. Before diving into using this, I should issue a word of caution when using this tool, generally it is very light weight, however I found that following the advice in this blog post throttled a throughput of 3300Mb/s to 200Mb/s,
For the heap scan we can see 68,736 ReadFileScatter system calls are being made, as aside, Bob Ward’s deck on SQL Server IO Internals‘s is recommended reading around this subject:
The number of calls to this ReadFileScatter is significantly more for the clustered index scan, 107,434 to be precise:
The most critical key performance metric for sequential scan performance is IO throughput, this is the IO throughput for the heap scan:
and this is what the throughput looks like for the clustered index scan:
We can see that apart from the first 3.5 seconds of the clustered index scan, the IO throughput being delivered is virtually constant at around 3300Mb/s unlike the throughput for the heap scan. Lets now have a look at the composition of the IO throughput by breaking it down into counts per IO size, this is for the heap scan:
and now for the clustered index scan:
The variance in IO sizes for the clustered index scan is significantly less than that for the heap scan and skewed towards 64K. Can we do better by forcing the IO size ?, there are two options for this, firstly the Microsoft white paper “Tuning Options for SQL Server when running in high performance workloads”; states that read-a-heads can be disabled:
Trace flag 652: Disable page pre-fetching scans
Trace flag 652 disables page pre-fetching during scans. You can turn on trace flag 652 at startup or in a user session. When you turn on trace flag 652 at startup, the trace flag has global scope. When you turn on trace flag 652 in a user session, the trace flag has session scope. If you turn on trace flag 652, SQL Server no longer brings database pages into the buffer pool before these database pages are consumed by the scans. If you turn on trace flag 652, queries that benefit from the page pre-fetching feature exhibit low performance.
Paul White ( @SQL_KIWI ) has mentioned the limitations of this trace flag with SQL Server 2012 in response to a DBA Stack Exchange question here. Lets see if the situation has changed with SQL Server 2014 CU 2, with trace flag 652 enabled, this is what the IO throughput looks like for the heap scan:
and these are the sizes of the reads:
It appears that trace flag 652 does work for heap scans. The theory behind trace flag 652 is that if you can read the data fast enough, its a waste of CPU cycles performing read-a-heads, however, based on the results here, it would seem that the sweet spot for IO throughput is closer to 64K than 8K.
The importance of putting these results into the context of the test hardware cannot be overstated. If the hardware included some form of storage array with a large cache, the likelihood is that it would favour aggressive read-a-heads, which would reverse these results and make the heap scan faster.
The second available option for influencing IO sizes is to encourage aggressive read-a-head activity. The start-up parameter: -E, documentation for which can be found here will cause the database engine to allocate four extents per datafile at a time, the rationale being that this should encourage more read-a-head friendly extent allocation. I therefore rebuild the heap and clustered index with this start-up parameter in use, below are the IO throughputs and IO sizes by count for the heap scan:
and the clustered index scan:
The use of the -E flag results in marginally better performance for both the clustered index scan and the heap scan.
In conclusion the clustered index scan out performs the heap scan because it results in IO sizes conducive to getting consistently high IO throughput from the hardware.
In the next post in this series I will look at the effect of the number of partitions on scan performance.