NUMA, The LOGCACHE_ACCESS Spinlock and The Best Machine for OLTP Workloads

The LOGCACHE_ACCESS spinlock is the final frontier when it comes to tuning SQL Server for OLTP workloads, this blog post will investigate how the NUMA node the workload is running on can have a massive impact on this one spinlock.

Let me first go over some key concepts covered in this blog post: The LOGCACHE_ACCESS Spinlock

All worker threads need to be able to copy logging data into the log buffer, access to this is controlled by the LOGCACHE_ACCESS spinlock. In tests performed by Microsoft, at scale and under load, spins on this one spinlock can utilize up to 20% of a server total CPU capacity. This diagram shows how the LOGCACHE_ACCESS spinlock relates the the log buffer:

logcache_access Non Uniform Memory Access

NUMA is a type of memory architecture whereby each processor has its own local bank of memory, the driver for this was the limited scalability of multi processor servers that accessed a single consolidated pool of memory via the same memory bus:

basicnuma

A server with NUMA architecture requires at least two populated CPU sockets, more advanced topologies are available when more than two processors are used:

advancednumaThe image above has been produced using information from Joe Chang’s blog with his permission.

Cache Lines and Cache Entries

A cache line is the unit of transfer for data between main memory and a CPU, for x64 bit Intel architecture based processors (Xeon i series onward), a cache line is always 64 bytes in size. When a cache line is loaded into the CPU cache, it is tagged with the requested memory location and becomes a cache entry.

Quick Path Inter-Connect

The Quick Path Inter-Connect (QPI) is the fabric which glues modern Intel processors together. In order that a ‘Coherent’ view of CPU cached data is presented to all processors, the concept of “Cache coherency” exists. A fundamental part of the cache coherency mechanism is the ability of cache entries to travel between CPU sockets or NUMA nodes via the quick path inter-connect using a “Cache coherency protocol”; the  Modified Exclusive Shared Invalid and Forwarding protocol for processors with the Nehalem micro architecture onward.

Hypothesis

When a worker thread needs to acquire the LOGCACHE_ACCESS spinlock, the cache entry associated with it is released by the log writer thread, the worker thread then writes to it before it is returned to the log writer. My hypothesis is that under load the overhead of this cache entry having to travel over the quick path inter-connect is noticeable when compared to the overhead of the LOGCACHE_ACCESS cache entry travelling between CPU cores on the same socket.

CPU Socket to CPU Socket Latency Numa2Numa Core To Core Latency core2core This is the test environment that will be used to investigate this:

  • Dell T5500 precision workstation with 2 x 6 core Westmere processors clocked at 2.4Ghz with both hyperthreading and turbo boost enabled.
  • 32GB of 1600Mhz triple channel memory.
  • 1 x 2.4TB Fusion IO ioDrive 2 duo card.
  • 2 x SanDisk Extreme PRO 480GB solid state drives
  • Windows server 2012 R2 enterprise edition.
  • SQL Server 2014 Enterprise Edition with CU 5 applied.

SQL Server configuration

  • Trace flags 1117, 1118, 8008 and 2330 specified as start-up parameters.
  • Lock pages in memory privilege granted to the database engine service account.
  • Min and max memory set to 28GB.
  • AdventureWorksDW 2014 used as the seed database.

Test data and environment creation:

CREATE TABLE [dbo].[MyBigTable] (
     [c1] [bigint]    NOT NULL
    ,[c2] [datetime]  NULL
    --
     CONSTRAINT [DF_BigTable_c2] DEFAULT (getdate())
    --
    ,[c3] [char](111) NULL
    --
     CONSTRAINT [DF_BigTable_c3] DEFAULT ('a')
    --
    ,[c4] [int]       NULL
    --
     CONSTRAINT [DF_BigTable_c4] DEFAULT ((1))
    --
    ,[c5] [int]      NULL DEFAULT ((2))
    ,[c6] [bigint]   NULL
    --
     CONSTRAINT [DF_BigTable_c6] DEFAULT ((42))
    --
    ,CONSTRAINT [PK_BigTable] PRIMARY KEY CLUSTERED ([c1] ASC)
         WITH ( PAD_INDEX              = OFF
               ,STATISTICS_NORECOMPUTE = OFF
               ,IGNORE_DUP_KEY         = OFF
               ,ALLOW_ROW_LOCKS        = ON
               ,ALLOW_PAGE_LOCKS       = OFF) ON [FG_DEST]) ON [FG_DEST]
GO

;WITH t4   AS (SELECT n
               FROM (VALUES(0),(0),(0),(0)) t(n))
     ,t256 AS (SELECT     0  AS n
               FROM       t4 AS a
               CROSS JOIN t4 AS b
               CROSS JOIN t4 AS c
               CROSS JOIN t4 AS d)
     ,t16M AS (SELECT     ROW_NUMBER() OVER (ORDER BY (a.n)) AS num
               FROM       t256 AS a
               CROSS JOIN t256 AS b
               CROSS JOIN t256 AS c)
INSERT INTO dbo.MyBigTable WITH(TABLOCKX)
SELECT  CAST(CRYPT_GEN_RANDOM(8) AS BIGINT)
       ,GETDATE()
       ,'a'
       ,1
       ,2
       ,42
FROM   t16M;

BACKUP LOG [AdventureWorksDW2014] TO DISK='NUL:'
CHECKPOINT
DBCC DROPCLEANBUFFERS

DBCC SQLPERF('sys.dm_os_latch_stats'   , CLEAR)
DBCC SQLPERF('sys.dm_os_spinlock_stats', CLEAR)
DBCC SQLPERF('sys.dm_os_wait_stats'    , CLEAR)

ALTER DATABASE AdventureWorksDW2014 
SET DELAYED_DURABILITY = FORCED;

I’m going to use ostress to execute 10 concurrent threads: ostress This is the code for the usp_insert stored procedure:

CREATE PROCEDURE [dbo].[usp_Insert]
     @RowsToInsert int = 10000000
AS
BEGIN
    SET NOCOUNT ON;

    DECLARE  @i INTEGER   = 0
            ,@k BIGINT    = 0
            ,@base BIGINT = @@SPID * 10000000000;

    WHILE @i < @RowsToInsert
    BEGIN
        BEGIN TRANSACTION
            INSERT INTO dbo.MyBigTable (c1)
            VALUES  (@base + @k)
                   ,(@base + @k + 1);
            SET @k += 2;
        COMMIT;
        SET @i += 2;
    END;
END;

Lets perform our first test, the CPU affinity mask will be altered to remove NUMA node 1 from the mask and the first 2 logical CPU’s NUMA note 0: NUMA node 0 Why Use 10 Logical Processors and Not The Full 12 Per NUMA Node ?

On the test hardware ‘CPU0’ and ‘CPU1’ on NUMA node 1 are hyper threads running on the same physical CPU core (core 0); this is the CPU core the log writer runs on. Any test running on NUMA node 0 will also use core 0 on NUMA node 1 because it needs to use the log writer. To create an even playing field so that the test running on NUMA node 1 is not at a disadvantage, each test will only use 10 logical processors on each NUMA node.

Results for The 1st Test

This test completes in 55 seconds, here are the top 5 most spin intensive spinlocks from this test: NUMA node 1 spins Now lets repeat the test, but with NUMA node 0 removed from the CPU mask along with the first two logical processors from NUMA node 1: NUMA node 1 Results for The 2nd Test This test completes in 1 minute and 25 seconds, the top 5 most spin intensive spinlocks from this run of the test are: NUMA node 0 spins Lets take a look at LOGCACHE_ACCESS spinlocks spins in the form of a simple graph: test 1 graph

This graph shows that there is 4.5 increase in LOGCACHE_ACCESS spins when the worker threads are not collocated on the same CPU socket as the log writer thread.

Hypothesis Proved

The tests have proved that the overhead of the worker threads not being co-located on the same CPU socket is tangible both in terms of elapsed time and LOGCACHE_ACCESS spinlock spins.

If we had a way of executing an OTLP workload more efficiently, would this increase the pressure on the LOGCACHE_ACCESS spinlock ?, with SQL Server 2014 we do  . . .

Hekaton

This is the the T-SQL used to create the test environment for Hekaton:

ALTER DATABASE [AdventureWorksDW2014]
    ADD FILEGROUP [imoltp_mod] CONTAINS MEMORY_OPTIMIZED_DATA
ALTER DATABASE [AdventureWorksDW2014]
    ADD FILE (name='imoltp_mod1', filename=N'e:\sqldata\imoltp_mod1')
TO FILEGROUP imoltp_mod
GO

CREATE TABLE [dbo].[MyBigTableIm] (
     [c1] [bigint]   NOT NULL
    --
    ,[c2] [datetime] NULL
    --
     CONSTRAINT [DF_BigTableIm_c2] DEFAULT (getdate())
    --
    ,[c3] [char](111) COLLATE SQL_Latin1_General_CP1_CI_AS NULL
    --
     CONSTRAINT [DF_BigTableIm_c3] DEFAULT ('a')
    --
    ,[c4] [int] NULL
    --
     CONSTRAINT [DF_BigTableIm_c4] DEFAULT ((1))
    --
    ,[c5] [int] NULL
    --
     CONSTRAINT [DF_BigTableIm_c5] DEFAULT ((2))
    --
    ,[c6] [bigint] NULL
    --
     CONSTRAINT [DF_BigTableIm_c6] DEFAULT ((42))
    ,PRIMARY KEY NONCLUSTERED HASH (
         [c1]
     ) WITH ( BUCKET_COUNT = 2097152)
) WITH ( MEMORY_OPTIMIZED = ON , DURABILITY = SCHEMA_AND_DATA )
GO

CREATE PROCEDURE [dbo].[usp_InsertIm]
     @spid         bigint
    ,@RowsToInsert int    = 2500000
WITH NATIVE_COMPILATION, SCHEMABINDING, EXECUTE AS OWNER
AS
BEGIN ATOMIC
    WITH ( TRANSACTION ISOLATION LEVEL = SNAPSHOT
          ,LANGUAGE = N'us_english')
  
    DECLARE  @i    INTEGER = 0
            ,@base BIGINT  = @spid * 10000000000;

    WHILE @i < @RowsToInsert
    BEGIN
        BEGIN
            INSERT INTO dbo.MyBigTableIm (c1) VALUES (@base + @i);
        END;
        SET @i += 1;
    END;
END;
GO

ALTER RESOURCE GOVERNOR RECONFIGURE
GO

CREATE RESOURCE POOL [im_pool] WITH (
    max_memory_percent=50 )
GO

EXEC sp_xtp_bind_db_resource_pool 'AdventureWorksDW2014', 'im_pool'

Because Hekaton requires twice the amount of memory a memory optimized tables takes up to be available in the pool assigned to the database containing it, I have had to go for 2,500,000 inserts with this test and not the 10,000,000 used before.

The T-SQL used to seed the table is identical to that used before except for the fact the table being inserted into is the MyBigTableIm table and Hekaton does not support TABLOCKX hints, therefore this had to be removed. On the subject of things Hekaton does not support, whilst creating this second test I noted the following things it does not like:

  • The @@SPID system function
  • INSERT statements using row value constructors
  • Conventional BEGIN TRANSACTION COMMIT blocks in stored procedures
  • Clustered indexes
  • TABLOCKX hints ( already mentioned, but added for completeness )
  • Memory optimized tables cannot be TRUNCATED at present, only deleted.

The ostress command line need to modified slightly also, to: ostress

Lets repeat the tests, first with 10 logical processors available on NUMA node 0, this completes in 10.434 seconds, the top 5 most spin intensive spinlocks are:

NUMA node 0 spins

The exact same test repeated using 10 logical processors on NUMA node 1 completes in 18.388 seconds, the top 5 most spin intensive spinlocks are:

NUMA node 1 spins

Using the technique I covered in the Diagnosing Spinlock Problems By Doing The Math we can determine the ratio of LOGCACHE_ACCESS spins to CPU cycles expended, with hyper threading enabled 10 threads will use 5 physical CPU cores:

   10.434          (elapsed time for test) x 5                   (CPU cores) x 240,000,000 (CPU cycles per second) =  12,520,800,000 CPU cycles

The first test has incurred 113,246,628 spins on LOGCACHE_ACCESS, 111 CPU cycles per LOGCACHE_ACCESS spin (to the nearest whole spin). Now the same for the second test:

   18.388            (elapsed time for test) x 5                   (CPU cores) x 240,000,000 (CPU cycles per second) = 22,065,600,000 CPU cycles

sys.dm_os_spinlocks_stats tells us that during the first test there has been 6,692,284,057 spins on the LOGCACHE_ACCESS spinlock which equates to 3 CPU cycles per spin.

Because 3 CPU cycles per spin seems like a conservative figure, lets see what the call stack can tell us, first for when the worker threads are running on NUMA node 0:

NUMA 0 stack

and for when the worker threads are running on NUMA node 1:

NUMA 1 stack

The LOGCACHE_ACCESS spin sampled CPU when the worker threads and log writer thread are collocated is 215.31 versus 374.22 when they are not.

Disappointingly the total sampled CPU time for the Hekaton test with worker thread log writer collocation is 131,969.88 compared to 146,277.34 when the worker threads and log writer are on different CPU sockets; the spin CPU time is a drop in the ocean in comparison.

Over my last few blog posts on spinlocks I have suspected that the minimum hardware requirement for creating excessive spinlock activity is a server with two CPU sockets and eight cores in each, a four CPU socket machine would be ideal.

Lets take a slight step back and compare the number of spins on LOGCACHE_ACCESS when the worker threads are not collocated and collocated on the same CPU socket as the log writer thread:

test 2 graph

The Hekaton test results in a 59 fold increase in LOGCACHE_ACCESS spins when there is no worker thread log writer thread CPU socket collocation according to sys.dm_os_spinlock_stats, however the call stack obtained through windows performance toolkit suggests otherwise.

Conclusion

When processing a heavy and sustained OLTP workload collocating the worker threads and log writer thread on the same CPU socket can result in a dramatic decrease in LOGCACHE_ACCESS spinlock spins,  this effect is more pronounced with Hekaton. There is some factor at play causing the workload to complete in a lower elapsed time when it is collocated with the log writer which I am yet to get to the bottom of.

As per the comment I made earlier on in this blog post, more CPU cores are required in order to create any pressure on the LOGCACHE_ACCESS spinlock of note. Pending the availability of a “Big box”, something along the lines of a HP DL580, it would be interesting to perform three tests in order to quantify the relationship between the number of NUMA modes and OLTP workload scalability:

  •  100 % of the workload on one CPU socket:test1
  • 100% of the workload split between two sockets:
    Test2
  • 100% of the workload split between all four CPU sockets:

Test3I am currently in the process of procuring test hardware to make these tests possible, so watch this space for further developments on this.

One thought on “NUMA, The LOGCACHE_ACCESS Spinlock and The Best Machine for OLTP Workloads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s