Tuning The LOGCACHE_ACCESS Spinlock On A “Big box”

To date I have theorized about the behavior of certain types of spinlocks, now I have access to hardware which can put these under serious pressure I can quantify much of what I have blogged about.

“A Tale Of Two Spinlocks”

The focus of this blog post will be on two spinlocks:

  • LOGCACHE_ACCESS
    A task’s thread has to acquire this before it can copy logging data from its own private memory area into the log buffer:logging
  • XDESMGR
    This serializes access to the part of the database engine that serves out transaction ids.

What Do You Class As A “Big Box” ?

I class a big box as one with enough physical cores to generate pressure on unique points of serialization within the database engine. There is one particular unique point of serialization I’m thinking about; the LOGCACHE_ACCESS spinlock. Earlier tests suggest that a server with two CPU sockets with more than six cores per socket is required to stress this spin-lock.

My “Big Box”

  • ASUS Z10-D16 WS motherboard
  • 2 x Xeon E5 2650 (10 core, 2.3Ghz) with hyper-threading enabled
  • 64GB quad channel DDR4
  • Windows server 2012 R2
  • SQL Server 2016 CTP 2.2
  • Baseline configuration trace flags: 1117, 1117, 2032 and 8008.
  • SQL Server max and min memory set to 57344MB
  • Lock pages in memory system privilege granted to database engine service account
  • Storage: RAID 0 64K allocation size volume composed of 3 x SanDisk Extreme Pro solid state drives (480GB) and 1 x FusionIO (Gen 1) 320GB ioDrive.
  • Delayed Durability set to FORCED on the test database

From an earlier blog post I found that a clustered index key crafted around @@SPID is highly scale-able. Below is a table with a clustered index, a SQL statement to ‘Seed’ it with data and a stored procedure to perform inserts when invoked by ostress from rml utilities:


CREATE TABLE [dbo].[MyBigTable](
 [c1] [bigint]    NOT NULL,
 [c2] [datetime]  NULL,
 [c3] [char](111) NULL,
 [c4] [int]       NULL,
 [c5] [int]       NULL,
 [c6] [bigint]    NULL,
 CONSTRAINT [PK_BigTable] PRIMARY KEY CLUSTERED (
     [c1] ASC
 ) WITH ( PAD_INDEX              = OFF
         ,STATISTICS_NORECOMPUTE = OFF
         ,IGNORE_DUP_KEY         = OFF
         ,ALLOW_ROW_LOCKS        = ON
         ,ALLOW_PAGE_LOCKS       = ON) ON [FG_01]
) ON [FG_01]
GO

TRUNCATE TABLE dbo.MyBigTable;
;WITH  t4   AS (SELECT n
                FROM   (VALUES(0),(0),(0),(0)) t(n))
      ,t256 AS (SELECT     0  AS n
                FROM       t4 AS a
                CROSS JOIN t4 AS b
                CROSS JOIN t4 AS c
                CROSS JOIN t4 AS d)
      ,t16M AS (SELECT     ROW_NUMBER()
                              OVER (ORDER BY (a.n)) AS num
                FROM       t256 AS a
                CROSS JOIN t256 AS b
                CROSS JOIN t256 AS c)
INSERT INTO dbo.MyBigTable WITH TABLOCKX
SELECT  CAST(CRYPT_GEN_RANDOM(8) AS BIGINT)
       ,GETDATE()
       ,'a'
       ,1
       ,2
       ,42
FROM   t16M;
BACKUP LOG TO DISK = 'NUL:'
DBCC SQLPERF("sys.dm_os_wait_stats"    , clear)
DBCC SQLPERF("sys.dm_os_spinlock_stats", clear)
DBCC SQLPERF("sys.dm_os_latch_stats"   , clear)

CREATE PROCEDURE [dbo].[usp_Insert] 
     @BatchSize    int = 1
    ,@RowsToInsert int = 2000000 
AS BEGIN 
    SET NOCOUNT ON;
   
    DECLARE  @i    INTEGER = 0 
            ,@base BIGINT  = @@SPID * 10000000000; 

    WHILE @i < @RowsToInsert
    BEGIN 
        BEGIN TRANSACTION 
            INSERT INTO dbo.MyBigTable (c1) 
                VALUES (@base + @i); 
        COMMIT; 
        SET @i += 1;
    END; 
END; 

This performs 186,915 inserts peer second for 40 threads with the following wait and spin-lock statistics:

base 40 waits

baseline spins

Can We Do Anything To Reduce LOGCACHE_ACCESS Pressure ?

Lets re-run the test with the first core of NUMA node 1 removed from the CPU affinity mask:

aff mask

This run produces 436,781 inserts per second for 38 threads with the following wait and spinlock activity:

waits

aff mask spins

The transaction log IO throughput is greater when the log writer CPU core has been removed from the CPU affinity mask. This is the log throughput for the default CPU affinity mask:

baseline log throughput

and this is the throughput with the log writer CPU core removed from the affinity mask:

aff mask log throughput

The implication is that the excessive spin activity on the LOGCACHE_ACCESS spinlock is throttling log writer flush IO throughput when the default CPU affinity mask is in use.

Where Is Our CPU Time Going ?

core0

I’ve highlighted LogAppendContext::BackOff, with the default CPU affinity mask this accounts for 23.89% of the total sampled CPU time, after removing core 0 from the CPU affinity mask this plummets to 1.43% !!!!

nocore0

Based on its name alone one would hope that LogAppendContext::BackOff has something to do with logging related spin-lock activity, what does the database engine call stack have to say about this ?:

core0_stack

We can see the memcpy function used to copy a sessions logging information into the log buffer and a call to sqlmin.dll!SQLServerLogMgr above it preceding our call to sqlmin.dll!LogAppendContext::BackOff. This suggests we are in the right area for log buffer access serialization.

Having established that removal the core the log writer is running on from the CPU affinity mask has a major impact on the insert rate, what do the various spin-lock and insert rate metrics look like ?. In the graphs below ‘Baseline’ refers to the tests with the CPU affinity mask in its default state and “Optimized CPU affinity masks” refers to the CPU affinity mask without the log writer CPU core present.

  • Baseline test insert rate and LOGCACHE_ACCESS spins / thread count

baseline spins

  • Insert rate and LOGCACHE_ACCESS spins / thread count with the ‘Optimised’ CPU affinity mask

aff mask spins

  • Insert rate comparison

insert rate comparison

  • LOGCACHE_ACCESS spin comparison

lca spins comparison

  • XDESMGR spin comparison

xdesmgr spins comparison

#Takeaway 1

‘Affinitizing’ the rest of the database engine away from the log writer CPU core increases DML throughput significantly as summarized in this graph:

insert rate improvement

#Takeaway 2

Referring to the LOGCACHE_ACCESS spin comparison graph, if we take away the outliers, the spin activity for this  spin-lock is relatively static across both tests, the key thing to note is that we are achieving a superior insert rate with the modified CPU affinity mask despite the LOGCACHE_ACCESS spin rate remaining almost the same.

#Takeaway 3

Improving the efficiency with which the LOGCACHE_ACCESS spin-lock is acquired and released increases pressure on the XDESMGR spin-lock.

How Has This Worked Exactly ?

Understanding how this optimization works requires that we understand a bit about memory, the unit of transfer between main memory and the CPU is the “Cache line”. Once inside the CPU cache hierarchy cache lines become cache entries and are tagged. The LOGCACHE_ACCESS spin-lock has its own cache entry. Any session’s thread wishing to copy logging information into the log buffer must first acquire the cache entry associated with the LOGCACHE_ACCESS spinlock:

spinlocks

The speed at which the LOGCACHE_ACCESS spinlock can be acquired and released is determined by how fast it’s cache entry can be handed off to the CPU core running the task thread attempting to copy logging information into the log buffer and returned. Allowing other database engine threads to run on the same core as the log writer introduces the possibility of the log writer thread being context switched in before this “Hand off” can take place. ‘Affinitizing’ the rest of the database engine away from the log writers home CPU core expedites the hand off process by  minimizing the number of context switches that might get added to this hand off process.

The amount of spin activity a spinlock undergoes is also influenced by how far the spinlock’s cache entry has to travel:

The Good The Bad and The Ugly

The fact that a context switch is the great white shark of performance killers cannot be overstated, a good ball park figure for main memory access is 167 CPU cycles plus any CPU cycles required by virtual to physical memory translation. To put into context the cost of a context switch (pun not intended) take note of this except from osdev wiki:

The overhead of storing / restoring registers (not counting any TLB overhead / excluding cost of FPU register store / restore) is 188 cycles on the P4 (67 ns), 79 cycles on the P2 (395 ns).

A context switch also includes the overhead of switching address spaces (if we’re switching between processes, not threads). The minimal cost of switching between two address spaces (counting a minimal TLB reload of 1 code page, 1 data page, and 1 stack page) is 516 cycles on a P4 (184 ns) and 177 cycles on a P3 (885 ns).

Excluding the cost of a process or thread transitioning from user mode to kernel mode, a context switch costs upwards of twice the number of CPU cycles required by main memory access.

Edit 23/09/2016

I should add that because the log writer runs on top of a scheduler, the “Context switch” the log writer undergoes involves yielding and being re-scheduled via the run-able queue, despite being cheaper than a full blown context switch this is:

a) Not free

b) By the scheduler thread switching between the task and the log writer, when the log writer gets the CPU core again it is likely it has lost data and instructions it needs in the L1/2 cache due to cache pollution.

Which NUMA Node Does The Log Writer Run On ?

I have mentioned that the log writer usually runs on core 0, but which CPU socket does this belong to ?, this following query can help shed some light on which NUMA node it runs on:

SELECT  s.parent_node_id
       ,s.cpu_id
FROM   sys.dm_os_tasks t
JOIN   sys.dm_os_schedulers s
ON     s.scheduler_id = t.scheduler_id
JOIN   sys.sysprocesses P
ON     p.spid = t.session_id
WHERE  p.cmd  = 'LOG WRITER'

This image illustrates what the query returns:

log writer location

With hyper-threading enabled, as is the case with my machine, even-numbered logical processors represent the first hyper-thread to run on a physical core and odd-numbered logical processors represent the second hype-thread to run on a physical core, I’ve said that you need to remove the physical core that the log writer is running on, you would be entitled to ask if this information is correct given that logical processor cpu_id 1 runs on the first physical core and cpu_id 2 runs on the second. Lets test this, we will remove NUMA node 1 completely from the CPU affinity mask and run two separate insert tests both with 16 threads, the first test with this affinity mask:

  • All logical processors are selected for NUMA node 0 except for the first and last two.

aff mask 1

  • All logical processors on NUMA node 0 are selected except for the for the first four:

aff mask 2

Based on these results produced via a test harness I have written the removal of the first CPU core from the affinity mask for the NUMA node the log writer is running is the correct way to modify to the CPU affinity mask:

resultsThe query returns two rows because one row is for the log writer and the other the checkpoint process, my “Reliable source” claims that they were separate processes at one stage:

reliable source

Can We Influence Which NUMA Node The Log Writer Is Assigned To ?

The answer would appear to be ‘No’:

thomas

Why ‘Affinitize’ Two Instances To Their Own NUMA Nodes ?

There is three very good reasons for wanting to do this:

  • The worse case scenario for the transfer of spinlock cache entry is the latency of the level 3 cache and not the quick path interconnect between CPU sockets.
  • One instance per socket for a two CPU socket server doubles the number of LOGCACHE_ACCESS and XDESMGR spinlocks available.
  • If each instances memory can be allocated 100% from the bank of memory associated with each CPU socket, this reduces foreign NUMA node memory access which is more expensive than local node memory access.

Trace Flags 8048 and 8015

Trace flag 8048 is documented in this CSS SQL Engineer’s blog post with the head line:

SQL Server 2008/2008 R2 on Newer Machines with More Than 8 CPUs Presented per NUMA Node May Need Trace Flag 8048

the post states that the trace flag is to be used when waits and latching activity on CMEMTHREAD and SOS_SUSPEND_QUEUE are present, despite this not being the case with any of the tests performed, I would still like to see what the effect of using this trace flag is.

Trace flag 8015 will also be tested, this causes the database engine to ignore hardware NUMA ( . . . and will make @sqL_handLe happy ! 😉 ). Using the CPU affinity mask with the log writer core removed as our new baseline, here is what these two trace flags do for our insert throughput:

TF8048_8015

Much to my surprise trace flag 8015 results in the best throughput over 500,000 inserts / second with 36 threads.

Cooling Things Down

  • Options for reducing pressure on LOGCACHE_ACCESS
    – Reduce the amount of logging
    – Remove the CPU core the log writer is running on from the CPU affinity mask.
    – Go for Xeon processors with the fastest clock speed available
    – ‘Affinitize’ the insert threads to the same CPU socket that the log writer is running on, more on this later.
    – Use the in memory OLTP engine, the following comes from the Microsoft research paper: Hekaton: SQL Server’s Memory-Optimized OLTP Engine

Since the tail of the transaction log is typically a bottleneck, reducing the number of log records appended to the log can improve scalability and significantly increase efficiency. Furthermore the content of the log for each transaction requires less space than systems that generate one log record per operation. Generating a log record only at transaction commit time is possible because Hekaton does not use write-ahead logging (WAL) to force log changes to durable storage before dirty data. Dirty data is never written to durable storage. Furthermore, Hekaton tries to group multiple log records into one large I/O; this is the basis for group commit and also a significant source of efficiency for Hekaton commit processing.

  • Options for reducing pressure on XDESMGR
    This serializes access to the part of the database engine which serves out transaction ids, the only option for reducing pressure on this spin lock to increase the number of DML statements and / or effected rows per transaction.

How Can The In Memory OLTP Engine Help ?

The Overhead of SQL Language Processing

The test using 34 threads and the log writer CPU core removed from the affinity mask results in the best throughput (trace flags 8048 and 8115 not used) and a language processing overhead of 12.32% of the total CPU time sampled. One would hope that natively compiled procedures might help here:
sqllang weightSpin-lock Overhead

The in memory OLTP engine is latch and lock free, but not spinlock free, 14.46% of the total sampled CPU time is expended on one spin-lick, but which one ?

sqpinlock weight

A technique known as inverting the call stack allows all the callers of the function in the call stack to be easily located:

inverted stack

From its association with the GetXdesId function, this spin-lock is the XDESMGR spin-lock. Would lowering the overhead of SQL language processing result in more pressure being placed on the XDESMGR spin-lock such that the net performance gain from using in memory OLTP engine is negligible ?.

Final Words

A deep understanding of how the database engine interacts with the CPU cache has helped come up with a means of reducing pressure on the LOGCACHE_ACCESS spinlock and increase singleton insert performance. However, this comes at the expense of pushing the bottleneck onto the XDESMGR spinlock. In future blog posts I will cover:

  • Reducing pressure on the XDESMGR spin-lock.
  • How the distance spin-lock cache entries have to travel between CPU cores and sockets effects performance.
  • Spin-lock pressure points when using the in memory OLTP engine compared to the conventional database engine.

5 thoughts on “Tuning The LOGCACHE_ACCESS Spinlock On A “Big box”

  1. Wow, you have surpassed yourself here. As I mentioned to you yesterday, these blog posts are very stack-intensive – I find myself constantly having to save context and learn some more background in order to keep up!

    One thing which I didn’t use to want to believe is now clearer. While Microsoft keeps exposing more and more “magic levers” in the form of trace flags and config options, there is still a lot wizardry (and I suppose occasional gambling) involved in finding the sweet spot among the permutations.

    Kind of wondering as well if the ongoing task of pushing the limits will be driving more aggressive code optimisation within SQL Server. Premature optimisation and all that: 99% of code probably doesn’t run in a tight enough loop to warrant CPU cycle and instruction counting. But the harder you push throughput on single-CPU activity, the lower that 99% will become. And at some point somebody may even ask if some bits deserve to be written in no-holds-barred assembler. SQL Server now lives in an Intel monoculture world which has changed a lot from the NT days where cross-platform portability was The Thing.

    • Certain code paths within the database engine are pretty optimal. Rather than code in assembler, and note that a good optimizing compiler should embody all the tricks of an expert assembly language programmer anyway, a database engine developer should focus on leveraging the CPU. This means using data structures which can be laid out and walked in memory sequentially where possibly in order to help the pre-fetcher, leveraging SIMD, knowing you way around the CPU KPIs and low level profiles such as VTune, knowing that when two threads write to the same cache line (entry) in the CPU cache that this bounces the level 2 cache etc etc. The bigger problem is that there are core things that a relational database engine does which are not CPU friendly, hashing, sorts and the index seeks associated with the nested loop join all destroy CPU cache efficiency. Regarding instruction set compatibility due to a cross licensing agreement between AMD and Intel whereby Intel license x64 from AMD and AMD license the x86 instruction set of Intel, this should be a non issue. To answer the question regarding magic levers must best advise is that this requires leg work on the grounds that no two workloads are identical, also the SQL community in general likes to dive into espousing performance tuning techniques without presenting much in the way of test evidence, i.e. this is my test hardware, this is how I configured SQL Server, this is how I generated my test data, this is how I ran my test to obtain a baseline, I hypothesize X, lets change Y and re-run the test in order to see if my theory holds true. A lot of material talks about straight OLTP or OLAP, I don’t thing this clear segregation exists in the real world as I suspect there is a hell of a lot of operational BI out there. As to what you can do, there is quite a bit which will be subject of the second new deck I will create for presentation on “The circuit”.

      • Thanks for that very considered response, Chris. I agree wholeheartedly with you about testing and knowing your experimental conditions, although I must admit that most of us tend to be content with generic “best practices” – hopefully acceptable habits that see you through the day except when you live near the edge.

        Just to clarify my comment about different CPUs, I wasn’t referring to AMD vs Intel but to the historic situation where there were other serious contenders (SPARC, Alpha, Itanium…). My point being that the 90s, where NT and SQL Server are rooted, involved more reticence about betting the farm on a specific architecture/instruction set and there was a good business case for abstracting details away as much as reasonably possible. Nowadays we still have the situation, but really only with mobile devices that won’t be running enterprise DBMSs, so in that sense we’re closer to a CPU monoculture again.

        I agree that a good optimising compiler should do wonders, and while I don’t have the expertise or time to assess this properly for myself, my perception is that real-world compilers tend to lag behind the cutting edge of what is possible, e.g. not favouring branchless logic or the use of SIMD. Now I’m not saying there is any NEED for this, but in the process of removing other bottlenecks it might come to light again one day. Unlikely I suppose except for tight code loops which can do a lot of useful work without memory fetches.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s