The LOGCACHE_ACCESS spinlock is the final frontier when it comes to tuning SQL Server for OLTP workloads, this blog post will investigate how the NUMA node the workload is running on can have a massive impact on this one spinlock.
Let me first go over some key concepts covered in this blog post: The LOGCACHE_ACCESS Spinlock
All worker threads need to be able to copy logging data into the log buffer, access to this is controlled by the LOGCACHE_ACCESS spinlock. In tests performed by Microsoft, at scale and under load, spins on this one spinlock can utilize up to 20% of a server total CPU capacity. This diagram shows how the LOGCACHE_ACCESS spinlock relates the the log buffer:
NUMA is a type of memory architecture whereby each processor has its own local bank of memory, the driver for this was the limited scalability of multi processor servers that accessed a single consolidated pool of memory via the same memory bus:
A server with NUMA architecture requires at least two populated CPU sockets, more advanced topologies are available when more than two processors are used:
Cache Lines and Cache Entries
A cache line is the unit of transfer for data between main memory and a CPU, for x64 bit Intel architecture based processors (Xeon i series onward), a cache line is always 64 bytes in size. When a cache line is loaded into the CPU cache, it is tagged with the requested memory location and becomes a cache entry.
Quick Path Inter-Connect
The Quick Path Inter-Connect (QPI) is the fabric which glues modern Intel processors together. In order that a ‘Coherent’ view of CPU cached data is presented to all processors, the concept of “Cache coherency” exists. A fundamental part of the cache coherency mechanism is the ability of cache entries to travel between CPU sockets or NUMA nodes via the quick path inter-connect using a “Cache coherency protocol”; the Modified Exclusive Shared Invalid and Forwarding protocol for processors with the Nehalem micro architecture onward.
When a worker thread needs to acquire the LOGCACHE_ACCESS spinlock, the cache entry associated with it is released by the log writer thread, the worker thread then writes to it before it is returned to the log writer. My hypothesis is that under load the overhead of this cache entry having to travel over the quick path inter-connect is noticeable when compared to the overhead of the LOGCACHE_ACCESS cache entry travelling between CPU cores on the same socket.
- Dell T5500 precision workstation with 2 x 6 core Westmere processors clocked at 2.4Ghz with both hyperthreading and turbo boost enabled.
- 32GB of 1600Mhz triple channel memory.
- 1 x 2.4TB Fusion IO ioDrive 2 duo card.
- 2 x SanDisk Extreme PRO 480GB solid state drives
- Windows server 2012 R2 enterprise edition.
- SQL Server 2014 Enterprise Edition with CU 5 applied.
SQL Server configuration
- Trace flags 1117, 1118, 8008 and 2330 specified as start-up parameters.
- Lock pages in memory privilege granted to the database engine service account.
- Min and max memory set to 28GB.
- AdventureWorksDW 2014 used as the seed database.
Test data and environment creation:
CREATE TABLE [dbo].[MyBigTable] ( [c1] [bigint] NOT NULL ,[c2] [datetime] NULL -- CONSTRAINT [DF_BigTable_c2] DEFAULT (getdate()) -- ,[c3] [char](111) NULL -- CONSTRAINT [DF_BigTable_c3] DEFAULT ('a') -- ,[c4] [int] NULL -- CONSTRAINT [DF_BigTable_c4] DEFAULT ((1)) -- ,[c5] [int] NULL DEFAULT ((2)) ,[c6] [bigint] NULL -- CONSTRAINT [DF_BigTable_c6] DEFAULT ((42)) -- ,CONSTRAINT [PK_BigTable] PRIMARY KEY CLUSTERED ([c1] ASC) WITH ( PAD_INDEX = OFF ,STATISTICS_NORECOMPUTE = OFF ,IGNORE_DUP_KEY = OFF ,ALLOW_ROW_LOCKS = ON ,ALLOW_PAGE_LOCKS = OFF) ON [FG_DEST]) ON [FG_DEST] GO ;WITH t4 AS (SELECT n FROM (VALUES(0),(0),(0),(0)) t(n)) ,t256 AS (SELECT 0 AS n FROM t4 AS a CROSS JOIN t4 AS b CROSS JOIN t4 AS c CROSS JOIN t4 AS d) ,t16M AS (SELECT ROW_NUMBER() OVER (ORDER BY (a.n)) AS num FROM t256 AS a CROSS JOIN t256 AS b CROSS JOIN t256 AS c) INSERT INTO dbo.MyBigTable WITH(TABLOCKX) SELECT CAST(CRYPT_GEN_RANDOM(8) AS BIGINT) ,GETDATE() ,'a' ,1 ,2 ,42 FROM t16M; BACKUP LOG [AdventureWorksDW2014] TO DISK='NUL:' CHECKPOINT DBCC DROPCLEANBUFFERS DBCC SQLPERF('sys.dm_os_latch_stats' , CLEAR) DBCC SQLPERF('sys.dm_os_spinlock_stats', CLEAR) DBCC SQLPERF('sys.dm_os_wait_stats' , CLEAR) ALTER DATABASE AdventureWorksDW2014 SET DELAYED_DURABILITY = FORCED;
CREATE PROCEDURE [dbo].[usp_Insert] @RowsToInsert int = 10000000 AS BEGIN SET NOCOUNT ON; DECLARE @i INTEGER = 0 ,@k BIGINT = 0 ,@base BIGINT = @@SPID * 10000000000; WHILE @i &amp;lt; @RowsToInsert BEGIN BEGIN TRANSACTION INSERT INTO dbo.MyBigTable (c1) VALUES (@base + @k) ,(@base + @k + 1); SET @k += 2; COMMIT; SET @i += 2; END; END;
Lets perform our first test, the CPU affinity mask will be altered to remove NUMA node 1 from the mask and the first 2 logical CPU’s NUMA note 0: Why Use 10 Logical Processors and Not The Full 12 Per NUMA Node ?
On the test hardware ‘CPU0’ and ‘CPU1’ on NUMA node 1 are hyper threads running on the same physical CPU core (core 0); this is the CPU core the log writer runs on. Any test running on NUMA node 0 will also use core 0 on NUMA node 1 because it needs to use the log writer. To create an even playing field so that the test running on NUMA node 1 is not at a disadvantage, each test will only use 10 logical processors on each NUMA node.
Results for The 1st Test
This test completes in 55 seconds, here are the top 5 most spin intensive spinlocks from this test: Now lets repeat the test, but with NUMA node 0 removed from the CPU mask along with the first two logical processors from NUMA node 1: Results for The 2nd Test This test completes in 1 minute and 25 seconds, the top 5 most spin intensive spinlocks from this run of the test are: Lets take a look at LOGCACHE_ACCESS spinlocks spins in the form of a simple graph:
This graph shows that there is 4.5 increase in LOGCACHE_ACCESS spins when the worker threads are not collocated on the same CPU socket as the log writer thread.
The tests have proved that the overhead of the worker threads not being co-located on the same CPU socket is tangible both in terms of elapsed time and LOGCACHE_ACCESS spinlock spins.
If we had a way of executing an OTLP workload more efficiently, would this increase the pressure on the LOGCACHE_ACCESS spinlock ?, with SQL Server 2014 we do . . .
This is the the T-SQL used to create the test environment for Hekaton:
ALTER DATABASE [AdventureWorksDW2014] ADD FILEGROUP [imoltp_mod] CONTAINS MEMORY_OPTIMIZED_DATA ALTER DATABASE [AdventureWorksDW2014] ADD FILE (name='imoltp_mod1', filename=N'e:\sqldata\imoltp_mod1') TO FILEGROUP imoltp_mod GO CREATE TABLE [dbo].[MyBigTableIm] ( [c1] [bigint] NOT NULL -- ,[c2] [datetime] NULL -- CONSTRAINT [DF_BigTableIm_c2] DEFAULT (getdate()) -- ,[c3] [char](111) COLLATE SQL_Latin1_General_CP1_CI_AS NULL -- CONSTRAINT [DF_BigTableIm_c3] DEFAULT ('a') -- ,[c4] [int] NULL -- CONSTRAINT [DF_BigTableIm_c4] DEFAULT ((1)) -- ,[c5] [int] NULL -- CONSTRAINT [DF_BigTableIm_c5] DEFAULT ((2)) -- ,[c6] [bigint] NULL -- CONSTRAINT [DF_BigTableIm_c6] DEFAULT ((42)) ,PRIMARY KEY NONCLUSTERED HASH ( [c1] ) WITH ( BUCKET_COUNT = 2097152) ) WITH ( MEMORY_OPTIMIZED = ON , DURABILITY = SCHEMA_AND_DATA ) GO CREATE PROCEDURE [dbo].[usp_InsertIm] @spid bigint ,@RowsToInsert int = 2500000 WITH NATIVE_COMPILATION, SCHEMABINDING, EXECUTE AS OWNER AS BEGIN ATOMIC WITH ( TRANSACTION ISOLATION LEVEL = SNAPSHOT ,LANGUAGE = N'us_english') DECLARE @i INTEGER = 0 ,@base BIGINT = @spid * 10000000000; WHILE @i < @RowsToInsert BEGIN BEGIN INSERT INTO dbo.MyBigTableIm (c1) VALUES (@base + @i); END; SET @i += 1; END; END; GO ALTER RESOURCE GOVERNOR RECONFIGURE GO CREATE RESOURCE POOL [im_pool] WITH ( max_memory_percent=50 ) GO EXEC sp_xtp_bind_db_resource_pool 'AdventureWorksDW2014', 'im_pool'
Because Hekaton requires twice the amount of memory a memory optimized tables takes up to be available in the pool assigned to the database containing it, I have had to go for 2,500,000 inserts with this test and not the 10,000,000 used before.
The T-SQL used to seed the table is identical to that used before except for the fact the table being inserted into is the MyBigTableIm table and Hekaton does not support TABLOCKX hints, therefore this had to be removed. On the subject of things Hekaton does not support, whilst creating this second test I noted the following things it does not like:
- The @@SPID system function
- INSERT statements using row value constructors
- Conventional BEGIN TRANSACTION COMMIT blocks in stored procedures
- Clustered indexes
- TABLOCKX hints ( already mentioned, but added for completeness )
- Memory optimized tables cannot be TRUNCATED at present, only deleted.
Lets repeat the tests, first with 10 logical processors available on NUMA node 0, this completes in 10.434 seconds, the top 5 most spin intensive spinlocks are:
The exact same test repeated using 10 logical processors on NUMA node 1 completes in 18.388 seconds, the top 5 most spin intensive spinlocks are:
Using the technique I covered in the Diagnosing Spinlock Problems By Doing The Math we can determine the ratio of LOGCACHE_ACCESS spins to CPU cycles expended, with hyper threading enabled 10 threads will use 5 physical CPU cores:
10.434 (elapsed time for test) x 5 (CPU cores) x 240,000,000 (CPU cycles per second) = 12,520,800,000 CPU cycles
The first test has incurred 113,246,628 spins on LOGCACHE_ACCESS, 111 CPU cycles per LOGCACHE_ACCESS spin (to the nearest whole spin). Now the same for the second test:
18.388 (elapsed time for test) x 5 (CPU cores) x 240,000,000 (CPU cycles per second) = 22,065,600,000 CPU cycles
sys.dm_os_spinlocks_stats tells us that during the first test there has been 6,692,284,057 spins on the LOGCACHE_ACCESS spinlock which equates to 3 CPU cycles per spin.
Because 3 CPU cycles per spin seems like a conservative figure, lets see what the call stack can tell us, first for when the worker threads are running on NUMA node 0:
and for when the worker threads are running on NUMA node 1:
The LOGCACHE_ACCESS spin sampled CPU when the worker threads and log writer thread are collocated is 215.31 versus 374.22 when they are not.
Disappointingly the total sampled CPU time for the Hekaton test with worker thread log writer collocation is 131,969.88 compared to 146,277.34 when the worker threads and log writer are on different CPU sockets; the spin CPU time is a drop in the ocean in comparison.
Over my last few blog posts on spinlocks I have suspected that the minimum hardware requirement for creating excessive spinlock activity is a server with two CPU sockets and eight cores in each, a four CPU socket machine would be ideal.
Lets take a slight step back and compare the number of spins on LOGCACHE_ACCESS when the worker threads are not collocated and collocated on the same CPU socket as the log writer thread:
The Hekaton test results in a 59 fold increase in LOGCACHE_ACCESS spins when there is no worker thread log writer thread CPU socket collocation according to sys.dm_os_spinlock_stats, however the call stack obtained through windows performance toolkit suggests otherwise.
When processing a heavy and sustained OLTP workload collocating the worker threads and log writer thread on the same CPU socket can result in a dramatic decrease in LOGCACHE_ACCESS spinlock spins, this effect is more pronounced with Hekaton. There is some factor at play causing the workload to complete in a lower elapsed time when it is collocated with the log writer which I am yet to get to the bottom of.
As per the comment I made earlier on in this blog post, more CPU cores are required in order to create any pressure on the LOGCACHE_ACCESS spinlock of note. Pending the availability of a “Big box”, something along the lines of a HP DL580, it would be interesting to perform three tests in order to quantify the relationship between the number of NUMA modes and OLTP workload scalability:
- 100 % of the workload on one CPU socket:
- 100% of the workload split between two sockets:
- 100% of the workload split between all four CPU sockets: