In this post I will look at the parallel insert in whole and in particular the affect that bulk extent allocation can have.
Here are the key takeaway points from the series so far:
- Takeaway #1 Hash partitioning
Performance of a straight heap scan is throttled by LATCH_EX waits on the parallel page supplier / range enumerator, the resolution to this is to hash partition the heap.
- Takeaway #2 Cost of column value inspection
The cost of inspecting column values is significant.
- Takeaway #3 Manual striping
A manual file group striping scheme of seven files on the ioDrive to one file on a conventional SSD resulted in the best scan throughput of 3400MB/s.
- Takeaway #4 Not all flash storage is packaged equally
A maximum IO throughput of 3GB/s was obtained on the ioDrive with 64KB reads, whilst obtaining the nest throughput on the conventional SSD required 512KB reads. I attribute this to the ioDrive having a completely different driver architecture that allows IO to be issued in parallel to the degree the maximum IO throughput of the device can be achieved with 64KB reads.
I was going to investigate the impact of allocation size on ioDrive read performance, however as I’ve been able to get the see full bandwidth of 3GB/s out of the card with an allocation unit size of 64KB, this has dropped down my list of things in terms of urgency.
This is the test hardware and the SQL used to generated the source data for the parallel insert:
At the time of writing the only way to leverage the parallel insert in SQL Server 2014 is via a SELECT INTO statement, I hope that this will be extended to include the INSERT SELECT statement in the next major version.
The ‘Manual’ striping scheme in use is the one on the right:
Lets test the parallel insert in its entirety, the destination file group has 48 datafiles, with a ratio of seven files on the IO drive to each file on one of the SSDs. The database recovery model is bulk logged:
The lowest elapsed time is obtained with a degree of parallelism of 18. Changing the striping scheme for the source table file group resulted in 512KB reads on the conventional SSDs and the best scan IO throughput, could forcing the IO size work for the INSERT side of things also ?.
The ‘E’ start up parameter causes the database engine to allocate 64 extents at a time before switching to the next file in the file group. The E parameter is mainly intended for use with data warehouses, in that contiguous extent allocation is conducive to read-a-heads during large scans. This is the effect it has on the parallel insert:
Performance, has gone backwards, the lowest elapsed time is now for a degree of parallelism 22.
Does the number of files in the destination file group make a difference ?
Something interesting happens when we increase the number of files in the file group the table is created in from 48 to 96:
This is the best result yet, the lowest elapsed time is for a degree of parallelism of 14 !. Why does a destination file group with 96 files instead of 48 make a difference ?. Every so often different files will experience “Latency spikes” whilst their contents are flushed to disk, more files increase the likelihood that a file being written to it is not subject to a latency spike:
Takeaway #5
When using solid state storage, there is a “Sweet spot” for the number of files in the file group being written to.
Whilst performing my tests with 96 file destination file group, I also tested a number of different configurations:
- ‘Baseline’
Only trace flag 1117 is enabled - E
E start-up parameter and trace flag 1117. - E, TF1118
E start-up parameter and trace flags 1117 and 1118. - TF1118
Trace flags 1117, 1118 enabled and no ‘E’ start-up parameter.
Takeaway #6
When used together; the E start-up parameter, trace flag 1118 and a destination file group with 96 files, result in the lowest elapsed times for the most degrees of parallelism tested.
Bulk Extent Allocation Deep Dive
The Windows Performance Analyzer (part of the windows performance toolkit) “Computation table” can help quantify the CPU expended by the entire database engine, all the statistics are for a degree of parallelism of 14:
We can see the exactly what is taking place during extent allocation:
- Now with the E start-up parameter in use:
That same code path is used for both cases, the only noticeable difference is a minor drop in CPU consumption when the E start-up parameter is in use.
Do wait statistics reveal anything ?:
- Wait statistics when the E start-up parameter is not in use:
- Wait statistics when the E start-up parameter is in use:
Allocating 64 extents at a time (instead of just 1) will cause intense bursts of activity on the space management bit (and bytes) maps, which is why more time is spent waiting on PAGIOLATCH_UP when the E start-up parameter is in play.
What about spin lock activity ?:
- With No E start-up parameter:
- E start-up parameter
The LOGCACHE_ACCESS spin lock controls access to the log buffer:
Is the LOGCACHE_ACCESS spin lock being held for longer when the E start-up parameter is not used because the volume of logging information copied into the log buffer is greater ?. Mining the transaction log via fn_dblog might reveal some clues:
For completeness, here are the counts per log operation and context:
Takeaway #7
With the E start-up parameter the total log record size for operations on the GAM and IAM bit maps and PFS byte map is significantly reduced !.
Finally lets take a look at IO throughput:
- IO Throughput with the E start-up parameter:
- IO throughput with the E start up parameter:
The E start up parameter results in IO throughput which is much more even and consistent than the throughput we get without it.
There is still more performance I can squeeze out of the parallel insert using techniques I have not covered yet in this series of post which I hope to blog about in the next post in the series.
It is so shameful that SQL Server cannot max out a powerful IO subsystem. All of the problems found here are correctable by the SQL Server team developers. The E startup flag should not be needed at all for example. SQL Server should just adapt to high rates of allocation.
There are some more tricks I have up my sleeve the squeeze some more performance out of the parallel insert. On the subject of what the SQL Server development team could do, however I would caveat this by saying that their hands are probably tied by the sales and marketing people, where do you want to start with regard to missed opportunities:
– The fact that parallel query execution is not scale-able, apparently some of the code to handle threading on the parallel iterators is horrid.
– Every tier one vendor that has a column store database leverages SIMD, Microsoft are yet to do this with SQL Server.
– The amount of caveats and missing features there are with Hekaton makes a strong case for it being a proof of concept.
– In my singleton insert blog post I have demonstrated that for a simple insert in a loop there is a significant overhead in T-SQL language processing (the time spent in sqllang.dll). Oracle allows all PL/SQL code to be natively compiled with just a single parameter change, something similar needs to be furnished across the WHOLE SQL server engine and not just Hekaton
– etc etc etc
Chris
Correct. We both could add a few more items here…