This blog post is one of my longer ones, if not the longest, however if you read it in its entirety, you will find out how to get the elapsed time of 24 concurrent singleton inserts down from over an hour to 1 minute and 45 seconds ( 1 minute and 25 seconds, if you are flexible about the term ‘Singleton’ 😉 ).
This is my hardware test setup:
- Dell T5500 with two hexa-core 2.4Ghz Xeons ( Westmere micro-architecture ) with hyper threading enabled.
- 32 Gb of DDR3 1600Mhz triple channel memory
- One 2.4Tb ioDrive 2 Duo.
- Two 480Gb SanDisk Extreme Pro solid state drives.
- RML Utilities 64 bit, I’m going to use this to fire of concurrent inserts at the database
- Windows Server 2012 R2
- SQL Server 2014 Enterprise Edition with CU 5
The server hosts a single instance configured as follows:
- min and max memory set to 28GB
- Trace flags 1117 specified as database engine start-up parameters
I should caveat the results in this blog with two important points:
- If you have a spinning disk based IO sub system, the results you are likely to see are likely to be radically different to those in this blog post.
- The IO throughput figures I quote are for log writer activity and not when checkpoints take place.
For the uninitiated, I use windows performance toolkit a lot in my work: xperf to create event tracing for windows files and windows performance view (WPA) to visualize the data these files contain, this mind map illustrates this process:
This is my test procedure, it allows me to vary the number of inserts per commit, the relevance of which will be explained later in this post:
CREATE PROCEDURE [dbo].[usp_Insert] @BatchSize int AS BEGIN SET NOCOUNT ON; DECLARE @i INTEGER = 0 ,@j INTEGER = 0; WHILE @i < 2000000 BEGIN BEGIN TRANSACTION WHILE @j < @BatchSize BEGIN INSERT INTO dbo.MyBigTable DEFAULT VALUES; SET @j += 1; END; COMMIT; SET @i += @BatchSize; SET @j = 0; END; END;
Initially I will create two different versions of the MyBigData table one to test INSERT throughput with a sequential key:
CREATE TABLE dbo.MyBigTable ( c1 UNIQUEIDENTIFIER NOT NULL ROWGUIDCOL CONSTRAINT DF_BigTable_c1 DEFAULT NEWSEQUENTUALID() CONSTRAINT PK_BigTable PRIMARY KEY CLUSTERED ,c2 DATETIME CONSTRAINT DF_BigTable_c2 DEFAULT GETDATE() ,c3 CHAR (111) CONSTRAINT DF_BigTable_c3 DEFAULT 'a' ,c4 INT CONSTRAINT DF_BigTable_c4 DEFAULT 1 ,c5 INT CONSTRAINT DF_BigTable_c5 DEFAULT 2 ,c6 BIGINT CONSTRAINT DF_BigTable_c6 DEFAULT 42 );
and one to test throughput with a random key:
CREATE TABLE dbo.MyBigTable ( c1 UNIQUEIDENTIFIER NOT NULL ROWGUIDCOL CONSTRAINT DF_BigTable_c1 DEFAULT NEWID() CONSTRAINT PK_BigTable PRIMARY KEY CLUSTERED ,c2 DATETIME CONSTRAINT DF_BigTable_c2 DEFAULT GETDATE() ,c3 CHAR (111) CONSTRAINT DF_BigTable_c3 DEFAULT 'a' ,c4 INT CONSTRAINT DF_BigTable_c4 DEFAULT 1 ,c5 INT CONSTRAINT DF_BigTable_c5 DEFAULT 2 ,c6 BIGINT CONSTRAINT DF_BigTable_c6 DEFAULT 42 );
Through the work from my super scaling the SQL Server 2014 parallel insert blog post series, a manual striping scheme of seven files on the ioDrive to one file on either of the SanDisk solid state drives gives the best possible IO throughput. I use the following statement to seed my tables with test data, it creates 16 million rows along with a couple of other statements to ‘Reset’ the environment for each test:
WITH t4 AS (SELECT n FROM (VALUES(0),(0),(0),(0)) t(n)) ,t256 AS (SELECT 0 AS n FROM t4 AS a CROSS JOIN t4 AS b CROSS JOIN t4 AS c CROSS JOIN t4 AS d) ,t16M AS (SELECT ROW_NUMBER() OVER (ORDER BY (a.n)) AS num FROM t256 AS a CROSS JOIN t256 AS b CROSS JOIN t256 AS c) INSERT INTO dbo.MyBigTable WITH(TABLOCKX) SELECT NEWID() ,GETDATE() ,'a' ,1 ,2 ,42 FROM t16M; CHECKPOINT DBCC DROPCLEANBUFFERS BACKUP LOG [AdventureWorksDW2014] TO DISK='NUL:' DBCC SQLPERF('sys.dm_os_wait_stats' , CLEAR) DBCC SQLPERF('sys.dm_os_latch_stats' , CLEAR) DBCC SQLPERF('sys.dm_os_spinlock_stats', CLEAR)
The first test with 24 threads and a cluster key populated by NEWSEQUENTIALID completes in 1 hour and 13 minutes, CPU consumption is around 10% and IO throughput 10MB/s, the wait statistics for this test run are:
The “Last page problem” throttles performance (image borrowed from Thomas Kejser with his permission):
Now for the cluster key populated with NEWID and 24 threads; elapsed time for the run is 20 minutes and 55 seconds, CPU utilization is 18% and IO throughput 20MB/s:
Takeaway #1 sequential keys kill concurrent insert performance when high performance IO sub systems are in use.
Repeating the test with trace flag 1118 enabled results in a CPU consumption of 40%, IO throughput of 90MB/s and an elapsed time 6 minutes and 45 seconds. Note the average wait time for the WRITELOG event:
This is the spin lock activity for the run:
Takeaway #2 for the best possible concurrent insert performance use trace flag 1118.
Before going any further, lets just consider the impact of using a sequential key versus a random key in the form of a simple graph:
In the spin lock statistics, XDESMGR is the most active spin lock, this serializes access to the part of the engine which serves out transaction ids, this is how it appears in the database engine call stack:
At present we have one insert per transaction, what happens if we increase this to 2 ?. Elapsed time drops to 4 minutes and 11 seconds, CPU utilization is 55% and IO throughput 130MB/s. These are our waits:
and this is our spin lock activity, spins on the XDESMGR spin lock drop from 108,068,068,500 to 5,898,183,250, a significant saving !.
Takeaway #3 batching, no matter how small the batch, makes a performance difference.
If you recall my blog post on large memory pages and LOGCACHE_ACCESS spin lock pressure, removing the first physical core from the CPU affinity mask for NUMA node 0 can help reduce this. This is how the LOGCACHE_ACCESS spin lock fits into the “Bigger picture” of the logging mechanism:
By ‘Affinitizing’ the rest of the database engine away from core 0 on NUMA node 0, removes the overhead of the log writer being context switched out by other database engine threads when handing off the LOGCACHE_ACCESS spin lock and receiving it back:
Lets remove core 0 from the CPU affinity mask and run the test again, elapsed time is 4 minutes and 50 seconds, CPU consumption 45 % and IO throughput 80 MB/s:
These are the spin locks stats, spins on LOGCACHE_ACCESS have been reduced from 11,123,850,476 to 3,576,595,589; an order of magnitude reduction !!!, sleep time on the spin lock has dropped also. Unfortunately elapsed time has gone up, I attribute this in part on running 24 threads on 22 logical processors, with hyper threading enabled two logical processors = 1 physical CPU core.
Here is a recap of the fruits of our tuning efforts so far:
For each test using a random key WRITELOG has always been the top wait event. Short of creating a RAM drive for the transaction log to reside on or perhaps re formatting the ioDrive such that its optimized for write performance, we will struggle to beat an average wait time of 0.1 ms for this wait event.
I tried using trace flag 610 in order to make my inserts minimally logged and trace flags 8048 and 9024 together ( FIX: High “log wait writes” counter on a SQL 2012 instance” ). Trace flag 2330 made little difference to performance either. Perhaps what is required is something more radical. SQL Server 2014 comes with something called “Delayed durability”, this allow log writes, or at least what we think are log file writes, to be queued up in a 60KB buffer for deferred hardening to the transaction log.
Lets give this a try with the same test as last time with the CPU affinity mask restored to its default, this gives us an IO throughput of 200MB/s and an elapsed time of 2 minutes and 25 seconds. What do our wait and spin lock statistics now look like ?:
WRITELOG has disappeared off the radar, what about spin lock statistics ?:
This is the first time we have seen the number of spins on LOGCACHE_ACCESS down to a 9 digit number when using the non-sequential key and the default CPU affinity mask.
Takeaway #4 delayed durability makes a difference to concurrent insert performance even when log writes are in the sub milli-second range.
NEWID() is peerless for distributing writes across a b-tree, however, at 16 bytes UNIQUEIDENTIFER is quite wide, can we do better ?. Lets try taking the SPID multiplying it by 10,000,000 to use as a base value for adding a sequence to. The test table now looks like this:
CREATE TABLE dbo.MyBigTable ( c1 bigint NOT NULL CONSTRAINT PK_BigTable PRIMARY KEY CLUSTERED ,c2 DATETIME CONSTRAINT DF_BigTable_c2 DEFAULT GETDATE() ,c3 CHAR (111) CONSTRAINT DF_BigTable_c3 DEFAULT 'a' ,c4 INT CONSTRAINT DF_BigTable_c4 DEFAULT 1 ,c5 INT CONSTRAINT DF_BigTable_c5 DEFAULT 2 ,c6 BIGINT CONSTRAINT DF_BigTable_c6 DEFAULT 42 );
and the usp_Insert stored procedure to insert into it looks like:
CREATE PROCEDURE [dbo].[usp_Insert] @BatchSize int = 1 ,@RowsToInsert int = 2000000 AS BEGIN SET NOCOUNT ON; DECLARE @i INTEGER ,@j INTEGER = 0 ,@k BIGINT = 0 ,@base BIGINT = @@SPID * 10000000000; WHILE @i < @RowsToInsert BEGIN BEGIN TRANSACTION WHILE @j < @BatchSize BEGIN INSERT INTO dbo.MyBigTable VALUES (@base + @k); SET @j += 1; SET @k += 1; END; COMMIT; SET @i += @BatchSize; SET @j = 0; END; END;
So that our testing represents an “Apples to apples” comparison, the T-SQL used to seed the MyBigData table will need modifying:
WITH t4 AS (SELECT n FROM (VALUES(0),(0),(0),(0)) t(n)) ,t256 AS (SELECT 0 AS n FROM t4 AS a CROSS JOIN t4 AS b CROSS JOIN t4 AS c CROSS JOIN t4 AS d) ,t16M AS (SELECT ROW_NUMBER() OVER (ORDER BY (a.n)) AS num FROM t256 AS a CROSS JOIN t256 AS b CROSS JOIN t256 AS c) INSERT INTO dbo.MyBigTable WITH(TABLOCKX) SELECT CAST(CRYPT_GEN_RANDOM(8) AS BIGINT) ,GETDATE() ,'a' ,1 ,2 ,42 FROM t16M;
This test completes in an elapsed time of 1 minute 53 seconds with a CPU utilization of 100% and an IO throughput of 235MB/, our waits are:
and our spin lock stats are:
Takeaway #5 (@@SPID * 10000000000) + sequence number is a superior performing key than NEWID().
I then tested what the elimination OPT_IDX_STATS spins through trace flag 2330 would result in, however, as before, this made no material difference to performance.
Let’s try large memory pages. Why do large memory pages provide a performance boost ?. This is the size of the memory management unit page table without large memory pages:
With large memory pages, the page table in the on CPU die memory management unit can cover a larger area of memory, thus reducing the chances of having to make expensive trips out to main memory 🙂
This results in the lowest elapsed time so far of 1 minutes and 45 seconds, a CPU utilization of 100% and an IO throughput of 245MB/s. Here are the wait and spin lock stats for this test:
Windows performance toolkit allows the cost saving of using large memory pages to be quantified, without large memory pages the sampled CPU time is 1,668,262ms
with large memory pages the sampled CPU time is 1,622,681 ms:
this equates to a saving of 2.7%, the best part of a logical CPU’s worth of processing time. The Westmere micro architecture allows the memory management unit to cache 32MB of logical to physical memory mapping information, increased to 64MB with the Sandybridge micro architecture, suggesting the saving would be greater with this.
Takeaway #6 for little effort, large memory pages give a noticeable if not significant increase in performance.
During my testing I found that even with 24 threads CPU utilization across the 24 logical processors available to SQL Server was very random and uneven at times, note the three logical processors I have highlighted below:
Windows performance toolkit backs this up, note the skew in sampled CPU time (ms) across the 24 logical processors, in particular note the weighting’s of CPUs: 15, 16 and 17 compared to 0, 7, 19 and 21:
With trace flag 8008 in use CPU utilization across all 24 logical processors is much more even. If you look at the task manager (followed by windows performance analyzer) screen shot below ; the load across all logical CPU’s is even as the ostress starts the 24 concurrent threads:
Trace flag 8008 forces the database engine (note that I’m using the enterprise edition SKU) to always use the scheduler with the least load, refer to: “How it works: SQL Server 2012 Database Engine Task Scheduling” for more information on this.
Takeaway #7 trace flag 8008 can have a major impact on the distribution of the database engine CPU load across all available logical processors.
Note: although this did not result in better performance, I obtained consistent results without having to run each test 3 or 4 times over.
Where is the CPU time being expended in the different layers of the databases engine during this last test ?:
Language processing alone expends 27% of the total CPU time; evident by the 465,355 ms on the same line as sqllang.dll in the table above. Processing T-SQL is expensive because it is an interpreted language, hence why stored procedures can be compiled to native code with the Hekaton in memory OLTP engine.
Lets repeat the test with a slightly modified version of the usp_Insert procedure which can insert two rows at a time using a single insert statement:
CREATE PROCEDURE [dbo].[usp_Insert] @BatchSize int = 1 ,@RowsToInsert int = 2000000 AS BEGIN SET NOCOUNT ON; DECLARE @i INTEGER = 0 ,@j INTEGER = 0 ,@base BIGINT = @@SPID * 10000000000; WHILE @i < @RowsToInsert BEGIN BEGIN TRANSACTION INSERT INTO dbo.MyBigTable (c1) VALUES (@base + @k) ,(@base + @k + 1); COMMIT; SET @k += 2; SET @i += @BatchSize; END; END;
This test runs in 1 minute and 25 seconds, CPU utilization is 105% and IO throughput 300MB/s with the following wait and spin lock statistics:
The multi-row insert has lowered the overall CPU cost of T-SQL language processing (time burned by sqllang.dll) from 27% to 22% !.
Takeaway #8 the overhead of processing the T-SQL language under load is significant.
This last test results in a CPU utilization of 105%, the CPUs are pushed that hard that turbo boost for the CPU clock frequency is invoked, hence this why the 100% CPU utilization ceiling can be breached. Where are our CPU cycles going ?, running the test again and collecting a stack trace with xperf -on base -stackwalk profile should provide the answer. The total weight for this run is 1,436,663 ( sampled CPU time in ms) of this. Nothing sprang out of the trace in terms of opportunities to further CPU consumption reductions, however I will point out a handful of interesting excerpts from the trace:
124,414 ms is expended on the seek required to locate the position for the insert
113,944 ms is expended obtaining the transaction id
94,253 ms on the insert of the record into the page
51,976 ms on page splits
245,338 ms is spent on the commit
60,136 ms is expended by the log writer
One last thing I wanted to test the effect of turning hyper threading off, this will result in the database engine seeing 12 and not 24 logical processors. Most people should now be using servers containing Intel Xeon “Core i” series architecture CPUs, this introduced a design for hyper threading in which the free CPU cycles created by a CPU stall ( or last level cache misses ) are used to schedule a second thread on the physical core:
Prior to this hyper threading was based on the “Net Burst” architecture, this used free pipeline slots to schedule instructions for the second (hyper) thread. The fact that the forth generation of the Pentium range only had 5 back-end processing units ( the context switching this caused probably didn’t help either ) resulted in disappointing performance. For the test, execution time is 1 minutes and 6 seconds, CPU utilization 105% and IO throughput 230MB/s, this is with 12 threads ! . Considering the same test runs in 1 minute and 25 seconds with 24 threads and hyper threading enabled, this provides the last takeaway point of this post:
Takeaway #9 for OLTP workloads use hyper-threading when Xeon “Core i” architecture processors are in use.
To close this blog post ( I thank you if you have stuck with it all the way to here ! ;-), here are all the test results consolidated into a single table ( “Page Pages” should read “Large Memory Pages”):
What about inserts into partitioning tables ?, I did test this and I will blog about this at some later date, also people will want to know how the GUID versus the @@SPID derived key compared under the best case scenario, the run times of these tests appear on lines 9 and 10 in the table above, however their times are probably best viewed in this simple graph:
I hope you have enjoyed reading this blog post.