The last blog post on spinlocks focused on the LOGCACHE_ACCESS spinlock and how pressure could be reduced on it by tweaking the CPU affinity mask. However, as is invariably the case with tuning you remove one bottleneck only for another to appear elsewhere, much like whack-a-mole.
To recap, this is the T-SQL I use to create the table to insert into, seed it with data followed by the stored procedure used for inserting into it, I’ve also included some statements for clearing down the transaction log and some of the performance related dynamic management views:
CREATE TABLE [dbo].[MyBigTable] ( [c1] [bigint] NOT NULL ,[c2] [datetime] NULL ,[c3] [char](111) NULL ,[c4] [int] NULL ,[c5] [int] NULL ,[c6] [bigint] NULL CONSTRAINT [PK_BigTable] PRIMARY KEY CLUSTERED ( [c1] ASC ) WITH ( PAD_INDEX = OFF ,STATISTICS_NORECOMPUTE = OFF ,IGNORE_DUP_KEY = OFF ,ALLOW_ROW_LOCKS = ON ,ALLOW_PAGE_LOCKS = ON) ON [FG_01] ) ON [FG_01] GO TRUNCATE TABLE dbo.MyBigTable; ;WITH t4 AS (SELECT n FROM (VALUES(0),(0),(0),(0)) t(n)) ,t256 AS (SELECT 0 AS n FROM t4 AS a CROSS JOIN t4 AS b CROSS JOIN t4 AS c CROSS JOIN t4 AS d) ,t16M AS (SELECT ROW_NUMBER() OVER (ORDER BY (a.n)) AS num FROM t256 AS a CROSS JOIN t256 AS b CROSS JOIN t256 AS c) INSERT INTO dbo.MyBigTable WITH TABLOCKX SELECT CAST(CRYPT_GEN_RANDOM(8) AS BIGINT) ,GETDATE() ,'a' ,1 ,2 ,42 FROM t16M; BACKUP LOG SingletonInsert TO DISK = 'NUL:' DBCC SQLPERF("sys.dm_os_wait_stats" , clear) DBCC SQLPERF("sys.dm_os_spinlock_stats", clear) DBCC SQLPERF("sys.dm_os_latch_stats" , clear) CREATE PROCEDURE [dbo].[usp_Insert] @BatchSize int = 1 ,@RowsToInsert int = 2000000 AS BEGIN SET NOCOUNT ON; DECLARE @i INTEGER = 0 ,@base BIGINT = @@SPID * 10000000000; WHILE @i < @RowsToInsert BEGIN BEGIN TRANSACTION INSERT INTO dbo.MyBigTable (c1) VALUES (@base + @i); COMMIT; SET @i += 1; END; END;
This is the test environment:
Trace flags 1117 and 8008 are enabled and the file group containing the table used for the test has one data file per logical processor; 40 in total. ostress from rml utilities (downloadable from here) is used to fire off the usp_insert stored procedure thus:
ostress -E -dimoltp -Q"exec usp_insertImolpt" -n<num threads>
Relieving the pressure on the LOGCACHE_ACCESS spinlock places more pressure on XDESMGR:
XDESMGR serializes access to the part of the database engine used to create and destroy transaction ids:
There are several solutions for dealing with this spinlock:
- Use processors that provide superior single threaded performance so that the spinlock can be acquired and released faster.
- Batch multiple DML statements up per transaction.
Increasing the batch size by just one results in a big jump in insert performance:
There are several places to go after extinguishing all tuning options in terms of low hanging fruit with the disk based engine:
Tuning 101: When The Bottleneck Is Contention On A Single Resource, Create More Of Them !
By their very nature both the LOGCACHE_ACCESS and XDESMGR spinlocks are singleton resources, however you can ‘Shard’ a database across instances and ‘Affinitize’ each instance to a NUMA node / CPU socket:
As per the bullet points above, this helps to kill two birds with one stone: you are no longer constrained to one LOGCACHE_ACCESS and XDESMGR spinlock, the cache lines associated with these spinlocks only have to burn 30 or so CPU cycles travelling across the level 3 cache instead of the quick path interconnect which links sockets. All cache lines inside the CPU cache are tagged with the main memory address they originate from.
‘Affinitizing’ an instance to a CPU socket is tricky, in that there is no way of explicitly binding the log writer to a NUMA node and because windows server 2012 R2 is used for these tests, windows system resource manager is not an option ( without using a hack ). The one remaining option is to create two virtual machines and tie each one to its own CPU socket, The CPU affinity mask makes performing the insert test on a single socket trivial and can indicate the results that the database sharding approach might yield, these are the results this gives us:
The test achieves a peak throughput of around 475,000 inserts per second for 18 threads, with both sockets the peak throughput was 500,000 inserts per second. Because the operating system loads one of the NUMA nodes up with threads, the results from the graph above are unlikely to translate to a 100% improvement in insert throughput if one instance per socket could be used, however if this is somewhere between 50 and 100% it’s still impressive.
Ulrich Drepper has produced some seminal white papers on memory, excerpts from these papers shed light on why spinlocks may inhibit database engine scalability on multi socket servers:
The CPU cache paper infers that a performance penalty is paid when passing spinlock cache lines around:
The top three most spin intensive spinlocks when using the disk based database engine and a CPU core dedicated to the log writer are:
Lets perform the insert test with an in memory table:
CREATE TABLE [dbo].[MyBigTable] ( [c1] [bigint] NOT NULL ,[c2] [datetime] NULL ,[c3] [char](111) COLLATE Latin1_General_CI_AS NULL ,[c4] [int] NULL ,[c5] [int] NULL ,[c6] [bigint] NULL CONSTRAINT [PK_MyBigTable] PRIMARY KEY NONCLUSTERED HASH ( [c1] ) WITH ( BUCKET_COUNT = 16777216) ) WITH ( MEMORY_OPTIMIZED = ON ,DURABILITY = SCHEMA_AND_DATA )
With hyper threading disabled and inserts into an in memory table and a regular stored procedure, we see a new entry into the top three spin intensive spinlocks:
There was almost no BLOCKER_ENUM spinlock activity when using a disk based table, but now there is, what does it do ?, its name reveals little, so lets dig into the call stack:
I suspect that the BLOCKER_ENUM might be a ‘Generic’ spinlock used in multiple scenarios, in this specific one it synchronizes transactions between the disk and in memory OLTP engines. Running the test again but with natively compiled stored procedure:
CREATE PROCEDURE [dbo].[usp_insertImoltp] @BatchSize int = 2 ,@RowsToInsert int = 2000000 ,@Spid bigint WITH NATIVE_COMPILATION, SCHEMABINDING AS BEGIN ATOMIC WITH ( TRANSACTION ISOLATION LEVEL = SNAPSHOT ,LANGUAGE = N'English') DECLARE @j INTEGER = 0 ,@base BIGINT = @Spid * 10000000000; WHILE @j < @RowsToInsert BEGIN INSERT INTO dbo.MyBigTable (c1) VALUES (@base + @j); SET @j += 1; END; END;
The in memory OLTP database engine does not support @@SPID, this requires that @@SPID being passed into the stored procedure on the ostress command line:
ostress -E -dimoltp -Q"exec usp_insertImolpt @Spid=@@SPID" -n<num threads>
. . . and et voilà !, the BLOCKER_ENUM spinlock has dropped out of the top ten spin intensive spinlocks, let alone the top three:
- The XDESMGR spinlock serializes access to the part of the disk based database engine that issues and destroys transaction ids.
- Reduce pressure on the XDESMGR spinlock by:
- Batch DML statements up per transaction.
- Use CPUs with superior single threaded performance
- Shard the database containing the table being inserted into across multiple instances
- Use the in memory OLTP engine as this does not use the XDESMGR spinlock at all,
- The mechanism that enables transactions to span both the disk and in-memory OLTP database engines uses the BLOCKER_ENUM spinlock.