Following a comment I received on twitter from @sqlcrossjoin I would like to focus on why a clustered index scan outperformed a heap scan for my test data. I will use the Windows Performance Analysis tool kit to investigate this in detail, but first I will recap on how my storage is set up:
My Fusion IO ioDrive2 duo card appears to the operating system as two disks, as the name ‘duo’ suggests; it is effectively two flash PCIe cards sandwiched together. The two 480Gb SanDisk Extreme Pro drives appear as drives 2 and 3, the E and C drives respectively. This is how the source data is created:
;WITH generator AS ( SELECT TOP 50000 id = ROW_NUMBER() OVER (ORDER BY a) FROM (SELECT a = 1 FROM master.dbo.syscolumns) c1 CROSS JOIN master.dbo.syscolumns c2 ) SELECT g1.id % 24 AS Col1 ,g1.id + 1 AS Col2 ,g1.id + 2 AS Col3 ,g1.id + 3 AS Col4 ,g1.id + 4 AS Col5 ,CASE g1.id % 2 WHEN 1 THEN AddressLine1 ELSE AddressLine2 END AS Col6 ,CASE g1.id % 7 WHEN 1 THEN AddressLine1 ELSE AddressLine2 END AS Col7 ,FirstName AS Col8 ,MiddleName AS Col9 ,LastName AS Col10 ,CustomerAlternateKey AS Col11 INTO BigData FROM generator g1 CROSS JOIN DimCustomer
I have a heap which is hash partitioned across 24 partitions and a clustered index partitioned in the same way. In order to rule out the affects of external fragmentation, I have created each object in its own file group.
There are my two test queries which I test using a cold buffer cache:
SELECT COUNT([Col1]) ,COUNT([Col3]) ,COUNT([Col4]) ,COUNT([Col5]) ,COUNT([Col6]) ,COUNT([Col7]) ,COUNT([Col8]) ,COUNT([Col9]) ,COUNT([Col10]) ,COUNT([Col11]) FROM [dbo].[BigDataHashPart24_CI]
SELECT COUNT([Col1]) ,COUNT([Col2]) ,COUNT([Col3]) ,COUNT([Col4]) ,COUNT([Col5]) ,COUNT([Col6]) ,COUNT([Col7]) ,COUNT([Col8]) ,COUNT([Col9]) ,COUNT([Col10]) ,COUNT([Col11]) FROM [dbo].[BigDataHashPart24_CI]
Averaged out over 10 executions and a cold buffer cache, the query against the clustered index takes 40377 ms to run. The same query against the heap has an average execution time 45842 ms per execution for ten executions. This is what the distribution of rows per worker thread looks like for the heap scan:
and this is the row distribution per worker thread for the clustered index scan:
This is the wait activity for a single execution of the query against the heap, I’ve used an extended event to isolate the wait activity for the execution of the test query:
and this is what the wait activity looks like for the clustered index scan:
My first reaction on seeing these results was that I should double check them, which is exactly what I did and they do stand up to scrutiny. On face value the clustered index scan is driving the IO sub system twice as hard as the heap scan. I used SET STATISTICS IO ON to obtain an initial grasp as to what sort of IO was taking place:
The 48 physical reads this reports for the clustered index scan from a cold buffer cache did not inspire me with confidence, therefore I went back to the Windows Performance Toolkit. Before diving into using this, I should issue a word of caution when using this tool, generally it is very light weight, however I found that following the advice in this blog post throttled a throughput of 3300Mb/s to 200Mb/s,
For the heap scan we can see 68,736 ReadFileScatter system calls are being made, as aside, Bob Ward’s deck on SQL Server IO Internals‘s is recommended reading around this subject:
The number of calls to this ReadFileScatter is significantly more for the clustered index scan, 107,434 to be precise:
The most critical key performance metric for sequential scan performance is IO throughput, this is the IO throughput for the heap scan:
and this is what the throughput looks like for the clustered index scan:
We can see that apart from the first 3.5 seconds of the clustered index scan, the IO throughput being delivered is virtually constant at around 3300Mb/s unlike the throughput for the heap scan. Lets now have a look at the composition of the IO throughput by breaking it down into counts per IO size, this is for the heap scan:
and now for the clustered index scan:
The variance in IO sizes for the clustered index scan is significantly less than that for the heap scan and skewed towards 64K. Can we do better by forcing the IO size ?, there are two options for this, firstly the Microsoft white paper “Tuning Options for SQL Server when running in high performance workloads”; states that read-a-heads can be disabled:
Trace flag 652: Disable page pre-fetching scans
Trace flag 652 disables page pre-fetching during scans. You can turn on trace flag 652 at startup or in a user session. When you turn on trace flag 652 at startup, the trace flag has global scope. When you turn on trace flag 652 in a user session, the trace flag has session scope. If you turn on trace flag 652, SQL Server no longer brings database pages into the buffer pool before these database pages are consumed by the scans. If you turn on trace flag 652, queries that benefit from the page pre-fetching feature exhibit low performance.
Paul White ( @SQL_KIWI ) has mentioned the limitations of this trace flag with SQL Server 2012 in response to a DBA Stack Exchange question here. Lets see if the situation has changed with SQL Server 2014 CU 2, with trace flag 652 enabled, this is what the IO throughput looks like for the heap scan:
and these are the sizes of the reads:
It appears that trace flag 652 does work for heap scans. The theory behind trace flag 652 is that if you can read the data fast enough, its a waste of CPU cycles performing read-a-heads, however, based on the results here, it would seem that the sweet spot for IO throughput is closer to 64K than 8K.
The importance of putting these results into the context of the test hardware cannot be overstated. If the hardware included some form of storage array with a large cache, the likelihood is that it would favour aggressive read-a-heads, which would reverse these results and make the heap scan faster.
The second available option for influencing IO sizes is to encourage aggressive read-a-head activity. The start-up parameter: -E, documentation for which can be found here will cause the database engine to allocate four extents per datafile at a time, the rationale being that this should encourage more read-a-head friendly extent allocation. I therefore rebuild the heap and clustered index with this start-up parameter in use, below are the IO throughputs and IO sizes by count for the heap scan:
and the clustered index scan:
The use of the -E flag results in marginally better performance for both the clustered index scan and the heap scan.
In conclusion the clustered index scan out performs the heap scan because it results in IO sizes conducive to getting consistently high IO throughput from the hardware.
In the next post in this series I will look at the effect of the number of partitions on scan performance.
I know from experience that SQL Server will issue full 512KB IOs on both CI and heaps in all cases if it can.
The IO patterns in your case suggest that there is fragmentation. Why else would SQL Server needlessly split IOs? You should check this.
I would not expect any difference in IO throughput if no fragmentation is present. If any is present specifying NOLOCK helps with the CI because it causes an allocation-order scan.
Yes, there is some fragmentation present, I’m trying to get a handle on why this is. Both the heap and clustered index have partitioning schemes which are associated with completely different file groups which nothing else resides in. Also the the two file groups were created with data file pre-sized at 10 Gb, after these two objects were populated none of the data files have grown in size, this rules out external fragmentation. As for internal fragmentation, the heap was created with an INSERT SELECT which is single threaded in nature, the clustered index is sorted on Col1 which contains an even distribution of values from 0 through to 23, my partitioning function maps each value to its own partition, therefore there should not be any internal fragmentation. I will however double check my results to ensure I have not dropped the ball anywhere and add an addendum to this post, as I am more concerned with my results being correct than being proven wrong.
From this description I can’t tell either what’s wrong. Sounds good.
You could map each partition to its own file group. Whatever is causing fragmentation here – hopefully that kills it.
SQL Server is awfully prone to fragmentation without any good reason. It could simply extend tables in contiguous chunks of 16MB or so.
If you create / rebuild indexes with a MAXDOP of 1 that helps, its well documented that doing this with any DOP higher than 1 can cause fragmentation, also as suggested, I’m going to have one file group per partition. I have a strange feeling, which is yet to be substantiated and I may be proven wrong that fragmentation might work you you ( me ) with the type of storage. Again I would caveat anything I do with the statement that you taking the results with the context of the test hardware is of paramount importance.