To date I have theorized about the behavior of certain types of spinlocks, now I have access to hardware which can put these under serious pressure I can quantify much of what I have blogged about.
“A Tale Of Two Spinlocks”
The focus of this blog post will be on two spinlocks:
A task’s thread has to acquire this before it can copy logging data from its own private memory area into the log buffer:
This serializes access to the part of the database engine that serves out transaction ids.
What Do You Class As A “Big Box” ?
I class a big box as one with enough physical cores to generate pressure on unique points of serialization within the database engine. There is one particular unique point of serialization I’m thinking about; the LOGCACHE_ACCESS spinlock. Earlier tests suggest that a server with two CPU sockets with more than six cores per socket is required to stress this spin-lock.
My “Big Box”
- ASUS Z10-D16 WS motherboard
- 2 x Xeon E5 2650 (10 core, 2.3Ghz) with hyper-threading enabled
- 64GB quad channel DDR4
- Windows server 2012 R2
- SQL Server 2016 CTP 2.2
- Baseline configuration trace flags: 1117, 1117, 2032 and 8008.
- SQL Server max and min memory set to 57344MB
- Lock pages in memory system privilege granted to database engine service account
- Storage: RAID 0 64K allocation size volume composed of 3 x SanDisk Extreme Pro solid state drives (480GB) and 1 x FusionIO (Gen 1) 320GB ioDrive.
- Delayed Durability set to FORCED on the test database
From an earlier blog post I found that a clustered index key crafted around @@SPID is highly scale-able. Below is a table with a clustered index, a SQL statement to ‘Seed’ it with data and a stored procedure to perform inserts when invoked by ostress from rml utilities:
CREATE TABLE [dbo].[MyBigTable]( [c1] [bigint] NOT NULL, [c2] [datetime] NULL, [c3] [char](111) NULL, [c4] [int] NULL, [c5] [int] NULL, [c6] [bigint] NULL, CONSTRAINT [PK_BigTable] PRIMARY KEY CLUSTERED ( [c1] ASC ) WITH ( PAD_INDEX = OFF ,STATISTICS_NORECOMPUTE = OFF ,IGNORE_DUP_KEY = OFF ,ALLOW_ROW_LOCKS = ON ,ALLOW_PAGE_LOCKS = ON) ON [FG_01] ) ON [FG_01] GO TRUNCATE TABLE dbo.MyBigTable; ;WITH t4 AS (SELECT n FROM (VALUES(0),(0),(0),(0)) t(n)) ,t256 AS (SELECT 0 AS n FROM t4 AS a CROSS JOIN t4 AS b CROSS JOIN t4 AS c CROSS JOIN t4 AS d) ,t16M AS (SELECT ROW_NUMBER() OVER (ORDER BY (a.n)) AS num FROM t256 AS a CROSS JOIN t256 AS b CROSS JOIN t256 AS c) INSERT INTO dbo.MyBigTable WITH TABLOCKX SELECT CAST(CRYPT_GEN_RANDOM(8) AS BIGINT) ,GETDATE() ,'a' ,1 ,2 ,42 FROM t16M; BACKUP LOG TO DISK = 'NUL:' DBCC SQLPERF("sys.dm_os_wait_stats" , clear) DBCC SQLPERF("sys.dm_os_spinlock_stats", clear) DBCC SQLPERF("sys.dm_os_latch_stats" , clear) CREATE PROCEDURE [dbo].[usp_Insert] @BatchSize int = 1 ,@RowsToInsert int = 2000000 AS BEGIN SET NOCOUNT ON; DECLARE @i INTEGER = 0 ,@base BIGINT = @@SPID * 10000000000; WHILE @i < @RowsToInsert BEGIN BEGIN TRANSACTION INSERT INTO dbo.MyBigTable (c1) VALUES (@base + @i); COMMIT; SET @i += 1; END; END;
This performs 186,915 inserts peer second for 40 threads with the following wait and spin-lock statistics:
Can We Do Anything To Reduce LOGCACHE_ACCESS Pressure ?
Lets re-run the test with the first core of NUMA node 1 removed from the CPU affinity mask:
This run produces 436,781 inserts per second for 38 threads with the following wait and spinlock activity:
The transaction log IO throughput is greater when the log writer CPU core has been removed from the CPU affinity mask. This is the log throughput for the default CPU affinity mask:
and this is the throughput with the log writer CPU core removed from the affinity mask:
The implication is that the excessive spin activity on the LOGCACHE_ACCESS spinlock is throttling log writer flush IO throughput when the default CPU affinity mask is in use.
Where Is Our CPU Time Going ?
I’ve highlighted LogAppendContext::BackOff, with the default CPU affinity mask this accounts for 23.89% of the total sampled CPU time, after removing core 0 from the CPU affinity mask this plummets to 1.43% !!!!
Based on its name alone one would hope that LogAppendContext::BackOff has something to do with logging related spin-lock activity, what does the database engine call stack have to say about this ?:
We can see the memcpy function used to copy a sessions logging information into the log buffer and a call to sqlmin.dll!SQLServerLogMgr above it preceding our call to sqlmin.dll!LogAppendContext::BackOff. This suggests we are in the right area for log buffer access serialization.
Having established that removal the core the log writer is running on from the CPU affinity mask has a major impact on the insert rate, what do the various spin-lock and insert rate metrics look like ?. In the graphs below ‘Baseline’ refers to the tests with the CPU affinity mask in its default state and “Optimized CPU affinity masks” refers to the CPU affinity mask without the log writer CPU core present.
- Baseline test insert rate and LOGCACHE_ACCESS spins / thread count
- Insert rate and LOGCACHE_ACCESS spins / thread count with the ‘Optimised’ CPU affinity mask
- Insert rate comparison
- XDESMGR spin comparison
‘Affinitizing’ the rest of the database engine away from the log writer CPU core increases DML throughput significantly as summarized in this graph:
Referring to the LOGCACHE_ACCESS spin comparison graph, if we take away the outliers, the spin activity for this spin-lock is relatively static across both tests, the key thing to note is that we are achieving a superior insert rate with the modified CPU affinity mask despite the LOGCACHE_ACCESS spin rate remaining almost the same.
Improving the efficiency with which the LOGCACHE_ACCESS spin-lock is acquired and released increases pressure on the XDESMGR spin-lock.
How Has This Worked Exactly ?
Understanding how this optimization works requires that we understand a bit about memory, the unit of transfer between main memory and the CPU is the “Cache line”. Once inside the CPU cache hierarchy cache lines become cache entries and are tagged. The LOGCACHE_ACCESS spin-lock has its own cache entry. Any session’s thread wishing to copy logging information into the log buffer must first acquire the cache entry associated with the LOGCACHE_ACCESS spinlock:
The speed at which the LOGCACHE_ACCESS spinlock can be acquired and released is determined by how fast it’s cache entry can be handed off to the CPU core running the task thread attempting to copy logging information into the log buffer and returned. Allowing other database engine threads to run on the same core as the log writer introduces the possibility of the log writer thread being context switched in before this “Hand off” can take place. ‘Affinitizing’ the rest of the database engine away from the log writers home CPU core expedites the hand off process by minimizing the number of context switches that might get added to this hand off process.
The amount of spin activity a spinlock undergoes is also influenced by how far the spinlock’s cache entry has to travel:
The fact that a context switch is the great white shark of performance killers cannot be overstated, a good ball park figure for main memory access is 167 CPU cycles plus any CPU cycles required by virtual to physical memory translation. To put into context the cost of a context switch (pun not intended) take note of this except from osdev wiki:
The overhead of storing / restoring registers (not counting any TLB overhead / excluding cost of FPU register store / restore) is 188 cycles on the P4 (67 ns), 79 cycles on the P2 (395 ns).
A context switch also includes the overhead of switching address spaces (if we’re switching between processes, not threads). The minimal cost of switching between two address spaces (counting a minimal TLB reload of 1 code page, 1 data page, and 1 stack page) is 516 cycles on a P4 (184 ns) and 177 cycles on a P3 (885 ns).
Excluding the cost of a process or thread transitioning from user mode to kernel mode, a context switch costs upwards of twice the number of CPU cycles required by main memory access.
I should add that because the log writer runs on top of a scheduler, the “Context switch” the log writer undergoes involves yielding and being re-scheduled via the run-able queue, despite being cheaper than a full blown context switch this is:
a) Not free
b) By the scheduler thread switching between the task and the log writer, when the log writer gets the CPU core again it is likely it has lost data and instructions it needs in the L1/2 cache due to cache pollution.
Which NUMA Node Does The Log Writer Run On ?
I have mentioned that the log writer usually runs on core 0, but which CPU socket does this belong to ?, this following query can help shed some light on which NUMA node it runs on:
SELECT s.parent_node_id ,s.cpu_id FROM sys.dm_os_tasks t JOIN sys.dm_os_schedulers s ON s.scheduler_id = t.scheduler_id JOIN sys.sysprocesses P ON p.spid = t.session_id WHERE p.cmd = 'LOG WRITER'
This image illustrates what the query returns:
With hyper-threading enabled, as is the case with my machine, even-numbered logical processors represent the first hyper-thread to run on a physical core and odd-numbered logical processors represent the second hype-thread to run on a physical core, I’ve said that you need to remove the physical core that the log writer is running on, you would be entitled to ask if this information is correct given that logical processor cpu_id 1 runs on the first physical core and cpu_id 2 runs on the second. Lets test this, we will remove NUMA node 1 completely from the CPU affinity mask and run two separate insert tests both with 16 threads, the first test with this affinity mask:
- All logical processors are selected for NUMA node 0 except for the first and last two.
- All logical processors on NUMA node 0 are selected except for the for the first four:
Based on these results produced via a test harness I have written the removal of the first CPU core from the affinity mask for the NUMA node the log writer is running is the correct way to modify to the CPU affinity mask:
Can We Influence Which NUMA Node The Log Writer Is Assigned To ?
The answer would appear to be ‘No’:
Why ‘Affinitize’ Two Instances To Their Own NUMA Nodes ?
There is three very good reasons for wanting to do this:
- The worse case scenario for the transfer of spinlock cache entry is the latency of the level 3 cache and not the quick path interconnect between CPU sockets.
- One instance per socket for a two CPU socket server doubles the number of LOGCACHE_ACCESS and XDESMGR spinlocks available.
- If each instances memory can be allocated 100% from the bank of memory associated with each CPU socket, this reduces foreign NUMA node memory access which is more expensive than local node memory access.
Trace Flags 8048 and 8015
Trace flag 8048 is documented in this CSS SQL Engineer’s blog post with the head line:
SQL Server 2008/2008 R2 on Newer Machines with More Than 8 CPUs Presented per NUMA Node May Need Trace Flag 8048
the post states that the trace flag is to be used when waits and latching activity on CMEMTHREAD and SOS_SUSPEND_QUEUE are present, despite this not being the case with any of the tests performed, I would still like to see what the effect of using this trace flag is.
Trace flag 8015 will also be tested, this causes the database engine to ignore hardware NUMA ( . . . and will make @sqL_handLe happy ! 😉 ). Using the CPU affinity mask with the log writer core removed as our new baseline, here is what these two trace flags do for our insert throughput:
Much to my surprise trace flag 8015 results in the best throughput over 500,000 inserts / second with 36 threads.
Cooling Things Down
- Options for reducing pressure on LOGCACHE_ACCESS
– Reduce the amount of logging
– Remove the CPU core the log writer is running on from the CPU affinity mask.
– Go for Xeon processors with the fastest clock speed available
– ‘Affinitize’ the insert threads to the same CPU socket that the log writer is running on, more on this later.
– Use the in memory OLTP engine, the following comes from the Microsoft research paper: Hekaton: SQL Server’s Memory-Optimized OLTP Engine
Since the tail of the transaction log is typically a bottleneck, reducing the number of log records appended to the log can improve scalability and significantly increase efficiency. Furthermore the content of the log for each transaction requires less space than systems that generate one log record per operation. Generating a log record only at transaction commit time is possible because Hekaton does not use write-ahead logging (WAL) to force log changes to durable storage before dirty data. Dirty data is never written to durable storage. Furthermore, Hekaton tries to group multiple log records into one large I/O; this is the basis for group commit and also a significant source of efficiency for Hekaton commit processing.
- Options for reducing pressure on XDESMGR
This serializes access to the part of the database engine which serves out transaction ids, the only option for reducing pressure on this spin lock to increase the number of DML statements and / or effected rows per transaction.
How Can The In Memory OLTP Engine Help ?
The Overhead of SQL Language Processing
The test using 34 threads and the log writer CPU core removed from the affinity mask results in the best throughput (trace flags 8048 and 8115 not used) and a language processing overhead of 12.32% of the total CPU time sampled. One would hope that natively compiled procedures might help here:
The in memory OLTP engine is latch and lock free, but not spinlock free, 14.46% of the total sampled CPU time is expended on one spin-lick, but which one ?
A technique known as inverting the call stack allows all the callers of the function in the call stack to be easily located:
From its association with the GetXdesId function, this spin-lock is the XDESMGR spin-lock. Would lowering the overhead of SQL language processing result in more pressure being placed on the XDESMGR spin-lock such that the net performance gain from using in memory OLTP engine is negligible ?.
A deep understanding of how the database engine interacts with the CPU cache has helped come up with a means of reducing pressure on the LOGCACHE_ACCESS spinlock and increase singleton insert performance. However, this comes at the expense of pushing the bottleneck onto the XDESMGR spinlock. In future blog posts I will cover:
- Reducing pressure on the XDESMGR spin-lock.
- How the distance spin-lock cache entries have to travel between CPU cores and sockets effects performance.
- Spin-lock pressure points when using the in memory OLTP engine compared to the conventional database engine.