Super Scaling Singleton Inserts

This blog post is one of my longer ones, if not the longest, however if you read it in its entirety, you will find out how to get the elapsed time of 24 concurrent singleton inserts down from over an hour to 1 minute and 45 seconds ( 1 minute and 25 seconds, if you are flexible about the term ‘Singleton’ 😉 ).

This is my hardware test setup:

  • Dell T5500 with two hexa-core 2.4Ghz Xeons ( Westmere micro-architecture ) with hyper threading enabled.
  • 32 Gb of DDR3 1600Mhz triple channel memory
  • One 2.4Tb  ioDrive 2 Duo.
  • Two 480Gb SanDisk Extreme Pro solid state drives.
  • RML Utilities 64 bit, I’m going to use this to fire of concurrent inserts at the database
  • Windows Server 2012 R2
  • SQL Server 2014 Enterprise Edition with CU 5

The server hosts a single instance configured as follows:

  • min and max memory set to 28GB
  • Trace flags 1117 specified as database engine start-up parameters

I should caveat the results in this blog with two important points:

  • If you have a spinning disk based IO sub system, the results you are likely to see are likely to be radically different to those in this blog post.
  • The IO throughput figures I quote are for log writer activity and not when checkpoints take place.

For the uninitiated, I use windows performance toolkit a lot in my work: xperf to create event tracing for windows files and windows performance view (WPA) to visualize the data these files contain, this mind map illustrates this process:

xperf mind map

This is my test procedure, it allows me to vary the number of inserts per commit, the relevance of which will be explained later in this post:

CREATE PROCEDURE [dbo].[usp_Insert] @BatchSize int AS
BEGIN
    SET NOCOUNT ON;

    DECLARE  @i INTEGER = 0
            ,@j INTEGER = 0;

    WHILE @i < 2000000
    BEGIN
        BEGIN TRANSACTION
            WHILE @j < @BatchSize
            BEGIN
                INSERT INTO dbo.MyBigTable
                    DEFAULT VALUES;
                SET @j += 1;
            END;
        COMMIT;

        SET @i += @BatchSize;
        SET @j = 0;
    END;
END;

Initially I will create two different versions of the MyBigData table one to test INSERT throughput with a sequential key:

CREATE TABLE dbo.MyBigTable (
     c1 UNIQUEIDENTIFIER NOT NULL ROWGUIDCOL
         CONSTRAINT DF_BigTable_c1 DEFAULT NEWSEQUENTUALID()
         CONSTRAINT PK_BigTable PRIMARY KEY CLUSTERED
    ,c2 DATETIME
         CONSTRAINT DF_BigTable_c2 DEFAULT GETDATE()
    ,c3 CHAR (111)
         CONSTRAINT DF_BigTable_c3 DEFAULT 'a'
    ,c4 INT
         CONSTRAINT DF_BigTable_c4 DEFAULT 1
    ,c5 INT
         CONSTRAINT DF_BigTable_c5 DEFAULT 2
    ,c6 BIGINT
         CONSTRAINT DF_BigTable_c6 DEFAULT 42
);

and one to test throughput with a random key:

CREATE TABLE dbo.MyBigTable (
     c1 UNIQUEIDENTIFIER NOT NULL ROWGUIDCOL
         CONSTRAINT DF_BigTable_c1 DEFAULT NEWID()
         CONSTRAINT PK_BigTable PRIMARY KEY CLUSTERED
    ,c2 DATETIME
         CONSTRAINT DF_BigTable_c2 DEFAULT GETDATE()
    ,c3 CHAR (111)
         CONSTRAINT DF_BigTable_c3 DEFAULT 'a'
    ,c4 INT
         CONSTRAINT DF_BigTable_c4 DEFAULT 1
    ,c5 INT
         CONSTRAINT DF_BigTable_c5 DEFAULT 2
    ,c6 BIGINT
         CONSTRAINT DF_BigTable_c6 DEFAULT 42
);

Through the work from my super scaling the SQL Server 2014 parallel insert blog post series, a manual striping scheme of seven files on the ioDrive to one file on either of the SanDisk solid state drives gives the best possible IO throughput. I use the following statement to seed my tables with test data, it creates 16 million rows along with a couple of other statements to ‘Reset’ the environment for each test:

WITH  t4    AS (SELECT n
                FROM   (VALUES(0),(0),(0),(0)) t(n))
     ,t256  AS (SELECT     0 AS n
                FROM       t4 AS a
                CROSS JOIN t4 AS b
                CROSS JOIN t4 AS c
                CROSS JOIN t4 AS d)
     ,t16M  AS (SELECT     ROW_NUMBER()
                               OVER (ORDER BY (a.n)) AS num
                FROM       t256 AS a
                CROSS JOIN t256 AS b
                CROSS JOIN t256 AS c)
INSERT INTO dbo.MyBigTable WITH(TABLOCKX)
SELECT  NEWID()
       ,GETDATE()
       ,'a'
       ,1
       ,2
       ,42
FROM   t16M;

CHECKPOINT
DBCC DROPCLEANBUFFERS

BACKUP LOG [AdventureWorksDW2014] TO DISK='NUL:'

DBCC SQLPERF('sys.dm_os_wait_stats'    , CLEAR)
DBCC SQLPERF('sys.dm_os_latch_stats'   , CLEAR)
DBCC SQLPERF('sys.dm_os_spinlock_stats', CLEAR)

The first test with 24 threads and a cluster key populated by NEWSEQUENTIALID completes in 1 hour and 13 minutes, CPU consumption is around 10% and IO throughput 10MB/s, the wait statistics for this test run are:

waits

The “Last page problem” throttles performance (image borrowed from Thomas Kejser with his permission):

last page

Now for the cluster key populated with NEWID and 24 threads; elapsed time for the run is 20 minutes and 55 seconds, CPU utilization is 18% and IO throughput 20MB/s:

waits

Takeaway #1 sequential keys kill concurrent insert performance when high performance IO sub systems are in use.

Repeating the test with trace flag 1118 enabled results in a CPU consumption of 40%, IO throughput of 90MB/s and an elapsed time 6 minutes and 45 seconds. Note the average wait time for the WRITELOG event:

waits

This is the spin lock activity for the run:

spins

Takeaway #2 for the best possible concurrent insert performance use trace flag 1118.

Before going any further, lets just consider the impact of using a sequential key versus a random key in the form of a simple graph:

graph

In the spin lock statistics, XDESMGR is the most active spin lock, this serializes access to the part of the engine which serves out transaction ids, this is how it appears in the database engine call stack:

xdesmgr

At present we have one insert per transaction, what happens if we increase this to 2 ?. Elapsed time drops to 4 minutes and 11 seconds, CPU utilization is 55% and IO throughput 130MB/s. These are our waits:

waits

and this is our spin lock activity, spins on the XDESMGR spin lock drop from 108,068,068,500 to 5,898,183,250, a significant saving !.

spins

Takeaway #3 batching, no matter how small the batch, makes a performance difference.

If you recall my blog post on large memory pages and LOGCACHE_ACCESS spin lock pressure, removing the first physical core from the CPU affinity mask for NUMA node 0 can help reduce this. This is how the LOGCACHE_ACCESS spin lock fits into the “Bigger picture” of the logging mechanism:

Log Buffer structure

By ‘Affinitizing’ the rest of the database engine away from core 0 on NUMA node 0, removes the overhead of the log writer being context switched out by other database engine threads when handing off the LOGCACHE_ACCESS spin lock and receiving it back:

LOGCACHE_ACCESS

Lets remove core 0 from the CPU affinity mask and run the test again, elapsed time is 4 minutes and 50 seconds, CPU consumption 45 % and IO throughput 80 MB/s:

waits

These are the spin locks stats, spins on LOGCACHE_ACCESS have been reduced from 11,123,850,476 to 3,576,595,589; an order of magnitude reduction !!!, sleep time on the spin lock has dropped also. Unfortunately elapsed time has gone up, I attribute this in part on running 24 threads on 22 logical processors, with hyper threading enabled two logical processors = 1 physical CPU core.

spins

Here is a recap of the fruits of our tuning efforts so far:

table

For each test using a random key WRITELOG has always been the top wait event. Short of creating a RAM drive for the transaction log to reside on or perhaps re formatting the ioDrive such that its optimized for write performance, we will struggle to beat an average wait time of 0.1 ms for this wait event.

I tried using trace flag 610 in order to make my inserts minimally logged and trace flags 8048 and 9024 together ( FIX: High “log wait writes” counter on a SQL 2012 instance” ). Trace flag 2330 made little difference to performance either. Perhaps what is required is something more radical. SQL Server 2014 comes with something called “Delayed durability”, this allow log writes, or at least what we think are log file writes, to be queued up in a 60KB buffer for deferred hardening to the transaction log.

Lets give this a try with the same test as last time with the CPU affinity mask restored to its default, this gives us an IO throughput of 200MB/s and an elapsed time of 2 minutes and 25 seconds. What do our wait and spin lock statistics now look like ?:

waits

WRITELOG has disappeared off the radar, what about spin lock statistics ?:

spins

This is the first time we have seen the number of spins on LOGCACHE_ACCESS down to a 9 digit number when using the non-sequential key and the default CPU affinity mask.

Takeaway #4 delayed durability makes a difference to concurrent insert performance even when log writes are in the sub milli-second range.

NEWID() is peerless for distributing writes across a b-tree, however, at 16 bytes UNIQUEIDENTIFER is quite wide, can we do better ?. Lets try taking the SPID multiplying it by 10,000,000 to use as a base value for adding a sequence to. The test table now looks like this:

CREATE TABLE dbo.MyBigTable (
     c1 bigint NOT NULL
         CONSTRAINT PK_BigTable PRIMARY KEY CLUSTERED
    ,c2 DATETIME
         CONSTRAINT DF_BigTable_c2 DEFAULT GETDATE()
    ,c3 CHAR (111)
         CONSTRAINT DF_BigTable_c3 DEFAULT 'a'
    ,c4 INT
         CONSTRAINT DF_BigTable_c4 DEFAULT 1
    ,c5 INT
         CONSTRAINT DF_BigTable_c5 DEFAULT 2
    ,c6 BIGINT
         CONSTRAINT DF_BigTable_c6 DEFAULT 42
);

and the usp_Insert stored procedure to insert into it looks like:

CREATE PROCEDURE [dbo].[usp_Insert]
     @BatchSize    int = 1
    ,@RowsToInsert int = 2000000
AS
BEGIN
    SET NOCOUNT ON;

    DECLARE   @i    INTEGER
             ,@j    INTEGER = 0
             ,@k    BIGINT  = 0
             ,@base BIGINT  = @@SPID * 10000000000;

    WHILE @i &lt; @RowsToInsert
    BEGIN
        BEGIN TRANSACTION
            WHILE @j < @BatchSize
            BEGIN
                INSERT INTO dbo.MyBigTable
                VALUES (@base + @k);

                SET @j += 1;
                SET @k += 1;
            END;
        COMMIT;

        SET @i += @BatchSize;
        SET @j  = 0;
    END;
END;

So that our testing represents an “Apples to apples” comparison, the T-SQL used to seed the MyBigData table will need modifying:

WITH  t4    AS (SELECT n
                FROM   (VALUES(0),(0),(0),(0)) t(n))
     ,t256  AS (SELECT     0 AS n
                FROM       t4 AS a
                CROSS JOIN t4 AS b
                CROSS JOIN t4 AS c
                CROSS JOIN t4 AS d)
     ,t16M  AS (SELECT     ROW_NUMBER()
                               OVER (ORDER BY (a.n)) AS num
                FROM       t256 AS a
                CROSS JOIN t256 AS b
                CROSS JOIN t256 AS c)
INSERT INTO dbo.MyBigTable WITH(TABLOCKX)
SELECT  CAST(CRYPT_GEN_RANDOM(8) AS BIGINT)
       ,GETDATE()
       ,'a'
       ,1
       ,2
       ,42
FROM   t16M;

This test completes in an elapsed time of 1 minute 53 seconds with a CPU utilization of 100% and an IO throughput of 235MB/, our waits are:

waits

and our spin lock stats are:

spins

Takeaway #5 (@@SPID * 10000000000) + sequence number is a superior performing key than NEWID().

I then tested what the elimination OPT_IDX_STATS spins through trace flag 2330 would result in, however, as before, this made no material difference to performance.

Let’s try large memory pages. Why do large memory pages provide a performance boost ?. This is the size of the memory management unit page table without large memory pages:

Default TLB

With large memory pages, the page table in the on CPU die memory management unit can cover a larger area of memory, thus reducing the chances of having to make expensive trips out to main memory 🙂

TLB large pages

This results in the lowest elapsed time so far of 1 minutes and 45 seconds, a CPU utilization of 100% and an IO throughput of 245MB/s. Here are the wait and spin lock stats for this test:

waits

spins

Windows performance toolkit allows the cost saving of using large memory pages to be quantified, without large memory pages the sampled CPU time is 1,668,262ms

no large mem pages

with large memory pages the sampled CPU time is 1,622,681 ms:

large mem pages

this equates to a saving of 2.7%, the best part of a logical CPU’s worth of processing time. The Westmere micro architecture allows the memory management unit to cache 32MB of logical to physical memory mapping information, increased to 64MB with the Sandybridge micro architecture, suggesting the saving would be greater with this.

Takeaway #6 for little effort, large memory pages give a noticeable if not significant increase in performance.

During my testing I found that even with 24 threads CPU utilization across the 24 logical processors available to SQL Server was very random and uneven at times, note the three logical processors I have highlighted below:

no_tf8008

Windows performance toolkit backs this up, note the skew in sampled CPU time (ms) across the 24 logical processors, in particular note the weighting’s of CPUs: 15, 16 and 17 compared to 0, 7, 19 and 21:

no_tf8008

With trace flag 8008 in use CPU utilization across all 24 logical processors is much more even. If you look at the task manager (followed by windows performance analyzer) screen shot below ; the load across all logical CPU’s is even as the ostress starts the 24 concurrent threads:

TF8008TF8008

Trace flag 8008 forces the database engine (note that I’m using the enterprise edition SKU) to always use the scheduler with the least load, refer to: “How it works: SQL Server 2012 Database Engine Task Scheduling” for more information on this.

Takeaway #7 trace flag 8008 can have a major impact on the distribution of the database engine CPU load across all available logical processors.

Note: although this did not result in better performance, I obtained consistent results without having to run each test 3 or 4 times over.

Where is the CPU time being expended in the different layers of the databases engine during this last test ?:

sqllang_no_opt

Language processing alone expends 27% of the total CPU time; evident by the 465,355 ms on the same line as sqllang.dll in the table above. Processing T-SQL is expensive because it is an interpreted language, hence why stored procedures can be compiled to native code with the Hekaton in memory OLTP engine.

Lets repeat the test with a slightly modified version of the usp_Insert procedure which can insert two rows at a time using a single insert statement:

CREATE PROCEDURE [dbo].[usp_Insert]
     @BatchSize int = 1
    ,@RowsToInsert int = 2000000
AS
BEGIN
    SET NOCOUNT ON;

    DECLARE  @i INTEGER = 0
            ,@j INTEGER = 0
            ,@base BIGINT = @@SPID * 10000000000;

    WHILE @i < @RowsToInsert
    BEGIN
        BEGIN TRANSACTION
            INSERT INTO dbo.MyBigTable (c1)
            VALUES  (@base + @k)
                   ,(@base + @k + 1);
        COMMIT;

        SET @k += 2;
        SET @i += @BatchSize;
    END;
END;

This test runs in 1 minute and 25 seconds, CPU utilization is 105% and IO throughput 300MB/s with the following wait and spin lock statistics:

waitsspins

sqllang_otp

The multi-row insert has lowered the overall CPU cost of T-SQL language processing (time burned by sqllang.dll) from 27% to 22% !.

Takeaway #8 the overhead of processing the T-SQL language under load is significant.

This last test results in a CPU utilization of 105%, the CPUs are pushed that hard that turbo boost for the CPU clock frequency is invoked, hence this why the 100% CPU utilization ceiling can be breached. Where are our CPU cycles going ?, running the test again and collecting a stack trace with xperf -on base -stackwalk profile should provide the answer. The total weight for this run is 1,436,663 ( sampled CPU time in ms) of this. Nothing sprang out of the trace in terms of opportunities to further CPU consumption reductions, however I will point out a handful of interesting excerpts from the trace:

124,414 ms is expended on the seek required to locate the position for the insert

seek

113,944 ms is expended obtaining the transaction id

xdesmgr

94,253 ms on the insert of the record into the page

insert

51,976 ms on page splits

split

245,338 ms is spent on the commit

commit

60,136 ms is expended by the log writer

logwriter

One last thing I wanted to test the effect of turning hyper threading off, this will result in the database engine seeing 12 and not 24 logical processors. Most people should now be using servers containing Intel Xeon “Core i” series architecture CPUs, this introduced a design  for hyper threading in which the free CPU cycles created by a CPU stall ( or last level cache misses ) are used to schedule a second thread on the physical core:

hyperthreading

Prior to this hyper threading was based on the “Net Burst” architecture, this used free pipeline slots to schedule instructions for the second (hyper) thread. The fact that the forth generation of the Pentium range only had 5 back-end processing units ( the context switching this caused probably didn’t help either ) resulted in disappointing performance. For the test, execution time is 1 minutes and 6 seconds, CPU utilization 105% and IO throughput 230MB/s, this is with 12 threads ! . Considering the same test runs in 1 minute and 25 seconds with 24 threads and hyper threading enabled, this provides the last takeaway point of this post:

Takeaway #9 for OLTP workloads use hyper-threading when Xeon “Core i” architecture processors are in use.

To close this blog post ( I thank you if you have stuck with it all the way to here ! ;-), here are all the test results consolidated into a single table ( “Page Pages” should read “Large Memory Pages”):

final table

What about inserts into partitioning tables ?, I did test this and I will blog about this at some later date, also people will want to know how the GUID versus the @@SPID derived key compared under the best case scenario, the run times of these tests appear on lines 9 and 10 in the table above, however their times are probably best viewed in this simple graph:

spid versus guid

I hope you have enjoyed reading this blog post.

7 thoughts on “Super Scaling Singleton Inserts

  1. In the last results table using 12 threads and hyper threading ( gives SQL Server 12 logical processors ) results in an elapsed time of 66 seconds, in the 3 rd line from the bottom of the same table the exact same test with HT on and 24 threads runs in 85 seconds, with HT on and 24 threads 24 x 2,000,000 rows are inserted in 85 seconds, with HT off and 12 treads half the number of rows are inserted in 66 seconds, hence takeaway #9. I could and should have made this point clearer.

  2. Another thing to test would be to singleton-insert into a Hekaton table and periodically (every 100ms?!) move the rows over to a regular table. That way most load and like 99.999% of the contention is borne by Hekaton.

    • Yes that on the list to do, I have not looked at Hekaton yet. The test would have to be subtly different in that it does not support unique indexes ( yet ).

    • Mark,

      I now have a 20 core Zeon E5 V3 machine at my disposal, with hyper-threading enabled this presents 40 logical processors to SQL Server. I’m very shortly going to see where the outer edges of performance with Hekaton in SQL 2016 CTP 2.2 are, watch this space 😉

      Chris

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s