Super Scaling The SQL Server 2014 Parallel Insert: Part 3, Scan Performance and Fragmentation

This post is continues the series on how I hope to squeeze every last ounce of performance out of my hardware when using SQL Server’s parallel insert feature. Given that my ioDrive and SanDisk SSDs have “On paper” sequential read ratings of 3GB/s and 550MB/s respectively, I was hoping to get the best part of 4GB/s throughput by combining the two types of storage. In a previous post in this series I mentioned that I had striped my data files such that for every six files on the ioDrive I had one on a SanDisk SSD, all within the same file group. I based this on anecdotal performance figures, however I thought I should take the opportunity to see if the 6:1 ratio was optimal. This is how I create my test data:

;WITH generator AS (
     SELECT     TOP 50000 id = ROW_NUMBER() OVER (ORDER BY a)
     FROM       (SELECT a = 1
                 FROM     master.dbo.syscolumns) c1
     CROSS JOIN master.dbo.syscolumns c2
)
SELECT  g1.id % 24                AS Col1
       ,g1.id + 1                 AS Col2
       ,g1.id + 2                 AS Col3
       ,g1.id + 3                 AS Col4
       ,g1.id + 4                 AS Col5
       ,CASE g1.id % 2
            WHEN 1
                THEN AddressLine1
                ELSE AddressLine2
        END                       AS Col6
       ,CASE g1.id % 7
            WHEN 1
                THEN AddressLine1
                ELSE AddressLine2
        END                       AS Col7
       ,FirstName                 AS Col8
       ,MiddleName                AS Col9
       ,LastName                  AS Col10
       ,CustomerAlternateKey      AS Col11
INTO   BigData
FROM   generator g1
CROSS JOIN DimCustomer

After quantifying the (considerable) cost of the engine having to inspect row values to check for NULLs in the first post of this series, I created a new version of the table with NOT NULL constraints and populated it as follows:

INSERT INTO [dbo].[BigDataHeap24] WITH (TABLOCK)
       ([Col1]
       ,[Col2]
       ,[Col3]
       ,[Col4]
       ,[Col5]
       ,[Col6]
       ,[Col7]
       ,[Col8]
       ,[Col9]
       ,[Col10]
       ,[Col11])
SELECT  [Col1]
       ,[Col2]
       ,[Col3]
       ,[Col4]
       ,[Col5]
       ,ISNULL([Col6] , 'X')
       ,ISNULL([Col7] , 'X')
       ,ISNULL([Col8] , 'X')
       ,ISNULL([Col9] , 'X')
       ,ISNULL([Col10], 'X')
       ,ISNULL([Col11], 'X')
FROM   [dbo].[BigData] WITH (NOLOCK)

To eliminate contention on the access data set parent latch, I hash partitioned the BigData heap and rebuilt it with different ratios of data files on the ioDrive to data files on the SanDisk Extreme Pro SSD. Here is my test query for scanning the BigData table, done from a cold buffer cache followed by the results I get:

SELECT  COUNT([Col1])
       ,COUNT([Col2])
       ,COUNT([Col3])
       ,COUNT([Col4])
       ,COUNT([Col5])
       ,COUNT([Col6])
       ,COUNT([Col7])
       ,COUNT([Col8])  
       ,COUNT([Col9])
       ,COUNT([Col10])
       ,COUNT([Col11])
FROM   [dbo].[BigData_FIO_PS] WITH (NOLOCK)
  • IO Drive / SanDisk SSD data file ratio 5:1 47544 ms

5_1 throughput

  • IO Drive / SanDisk SSD data file ratio 6:1 46869 ms

6_1 throughput

  • IO Drive / SanDisk SSD data file ratio 7:1 43352 ms7_1 throughput
  • IO Drive / SanDisk SSD data file ratio 8:1 45858 ms

8_1 throughput

Taking the sweet spot of a partitioned heap with seven ioDrive data files for each SanDisk SSD data file, this graph illustrates what the elapsed time (ms) and CPU time (ms) looks like as the degree of parallelism increases:

Scan Times Per Dop

Note the spike in elapsed time as we cross the NUMA boundary. To check that only one socket is utilized with a DOP of 12, lets look at CPU utilization at the logical processor and NUMA node level:

CPU Utilisation by logical processor

CPU Utilisation by NUMA node

It appears from task manager that SQL OS is scheduling threads to run on the same NUMA node, to verify that this is definitely the case I altered the CPU affinity mask such that my instance only used the CPU cores from NUMA node 1, this resulted in the following CPU utilization at logical processor and NUMA node level:

CPU Utilisation by logical processor NUMA Node 1

CPU Utilisation by NUMA node 1

The scan of the partitioned heap with the 7:1 ioDrive to SanDisk SSD data file ratio has distinct IO patterns, firstly; the largest IO size achieved for the IO Drive is 64K:

7_1 io sizes IODrive

For the SanDisk ssd we see 512K reads:

7_1 io sizes SanDisk

Fragmentation is the usual cause of read-a-head suppression, lets see what this looks like for the heap:

Fragmentation Heap 7_1

Despite the fragmentation being minimal, lets try eliminating this by using the -E start-up parameter, this causes SQL Server to allocate sixty four extents at a time before switching to the next data file when a file group contains more than one data file. But would this make any difference to fragmentation ?, from what sys.db_index_physical_stats reported the answer was little or no difference:

Fragmentation Heap 7_1

The whole point of this exercise was to ascertain why I was not getting more 512K reads, more importantly whether the conventional wisdom and lore of optimizing IO throughput for the old world of spinning disks applies to the new world of flash:

throughput_lore

To put to bed the issue as to whether getting 512K reads on the ioDrive was important once and for all, I created a heap 100% of which resided on the ioDrive, this is the throughput I got when I scanned it using my SELECT COUNT( . . . query:

throughput

The maximum read IO size I got was 64K, given that the ioDrive has a maximum IO throughput rating of 3Gb/s for reads, this proves conclusively that we do not need to worry about read aheads for the ioDrive.

However, what about the SanDisk solid state drives ?. lets create the same heap across the two SanDisk drives, the rationale being that I know for sure that the volumes residing on these are not fragmented . Each partition resides in its own file group comprising of a single data file presized to avoid any external fragmentation, this resulted in these IO sizes:

Sandisk IO sizes

These are what the CPU and elapsed times obtained via SET STATISTICS TIME ON look like for this scan:

CPU time = 206453 ms, elapsed time = 298584 ms.
CPU time = 207532 ms, elapsed time = 298401 ms.
CPU time = 207218 ms, elapsed time = 298362 ms.
CPU time = 207659 ms, elapsed time = 298576 ms.
CPU time = 208155 ms, elapsed time = 298513 ms.

with trace flag 652 on, which disables read-a-heads, we get:

CPU time = 312155 ms, elapsed time = 431012 ms.
CPU time = 315893 ms, elapsed time = 425290 ms.
CPU time = 299375 ms, elapsed time = 425108 ms.
CPU time = 296050 ms, elapsed time = 432489 ms.
CPU time = 300702 ms, elapsed time = 430475 ms.

to doubly make sure that trace flag 652 was working as advertised, these are the IO sizes obtained with it on:

Sandisk IO sizes TF652

This begs the question; which striping scheme is best ?, does it actually matter ?:

striping

The scheme on the left is the one currently in use and the second graph below shows the IO throughput profile we get using it, note the part of the graph I have highlighted where the throughput tails off, the scan of the table takes 56 seconds:

tail endWhat happens if we try the second striping scheme ?, with both types of storage combined we get a much more consistent throughput of around 3400MB/s !, also elapsed time for the scan is now 26 seconds:

new

I suspect that if the motherboard on my Dell T5500 provided native support for SATA 6GB/s ( instead I have use a SATA 6GB/s adapter card ) I would see better IO throughput than this. However, because I’m seeing better throughput than what most people will see from their SAN’s I’m not going to complain too much ;-). Lets take as look at the IO sizes we are now getting, the ioDrive figures are on the left and SSD figures on the right:

more_512KThe answer to the question,: “Is there value is pursuing read-a-head with flash ?”, is not a straight Yes/No one. For flash packaged up to use the standard SATA interface, the answer is yes, I suspect that this applies equally to SAS drives. For fusion IO cards (and cards using NVMe) the answer is potentailly no, however I hope to explore this in greater detail in my next post. I could write an entire blog post on NVMe – non-volatile memory express, were it not for the fact that Anandtech have already done a very good job of this already. Fusion IO were ahead of the game in realising that standard interfaces and the driver stack held flash storage back, to solve this problem they effectively produced their own bespoke version of the NVMe IO stack. I suspect because we are using an IO stack designed from scratch to leverage flash storage, this is a factor behind why 64K reads are giving us the best possible IO throughput for the heap scan. The following illustrates the efficiency in CPU cycles of the NVMe driver stack on Linux versus the traditional stack which SAS and SATA drives use (reproduced from the Anandtech article), it is not Windows and not Fusion IO’s virtual storage layer, but it still provides a good illustration:

NVMe vs SATA

Using the windows performance toolkit I can perform a “Comparative analysis” of the call stack of my test query using a DOP of 24 against a SATA 6GB/s SSD heap scan versus the ioDrive heap scan. To create a level playing field I have used trace flag 652 to force all reads to be 8K in size. This is what the comparative analysis gives me:

fio_ssd_stack_diff

The fourth column along represents ‘Weight’, sampled CPU time (ms) across all CPU cores in other words, it has negative numbers because of the differences in CPU utilization between the call stacks. Let me pick some key things out of this:

  • storport.sys
    This is the storage port driver, I would wager that the ioDrive is not using this at all, where as my two SATA SSDs are using it via the SATA 6GB adapter card.
  • storahci.sys
    This is the “Advanced Host Controller Interface” (AHCI) driver. AHCI is the protocol used to communicate across the SATA interface. As the ioDrive is not a SATA device it should come as no surprise that the test using the ioDrive has burned no cycles at all using this.

In the next post in the series I will aim to cover the differences in IO when performing a scan on a SSD versus the ioDrive in more detail, I also hope to find out whether using different allocation sizes other than 64K on the ioDrive affects performance.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s