Tuning The XDESMGR Spinlock On A “Big box”

The last blog post on spinlocks focused on the LOGCACHE_ACCESS spinlock and how pressure could be reduced on it by tweaking the CPU affinity mask. However, as is invariably the case with tuning you remove one bottleneck only for another to appear elsewhere, much like whack-a-mole.

whack-a-mole

To recap, this is the T-SQL I use to create the table to insert into, seed it with data followed by the stored procedure used for inserting into it, I’ve also included some statements for clearing down the transaction log and some of the performance related dynamic management views:

CREATE TABLE [dbo].[MyBigTable] (
     [c1] [bigint]    NOT NULL
    ,[c2] [datetime]  NULL
    ,[c3] [char](111) NULL
    ,[c4] [int]       NULL
    ,[c5] [int]       NULL
    ,[c6] [bigint]    NULL 
    CONSTRAINT [PK_BigTable] PRIMARY KEY CLUSTERED ( 
        [c1] ASC ) 
    WITH ( PAD_INDEX              = OFF 
          ,STATISTICS_NORECOMPUTE = OFF
          ,IGNORE_DUP_KEY         = OFF
          ,ALLOW_ROW_LOCKS        = ON 
          ,ALLOW_PAGE_LOCKS       = ON) ON [FG_01] ) 
ON [FG_01] 
GO 
TRUNCATE TABLE dbo.MyBigTable; 

;WITH t4   AS (SELECT     n 
               FROM       (VALUES(0),(0),(0),(0)) t(n)) 
     ,t256 AS (SELECT     0  AS n 
               FROM       t4 AS a 
               CROSS JOIN t4 AS b 
               CROSS JOIN t4 AS c 
               CROSS JOIN t4 AS d)
     ,t16M AS (SELECT     ROW_NUMBER() 
                              OVER (ORDER BY (a.n)) AS num 
               FROM       t256 AS a CROSS JOIN t256 AS b 
               CROSS JOIN t256 AS c) 
INSERT INTO dbo.MyBigTable WITH TABLOCKX 
SELECT  CAST(CRYPT_GEN_RANDOM(8) AS BIGINT) 
       ,GETDATE() 
       ,'a'
       ,1
       ,2
       ,42
FROM   t16M; 

BACKUP LOG SingletonInsert TO DISK = 'NUL:' 
DBCC SQLPERF("sys.dm_os_wait_stats"    , clear) 
DBCC SQLPERF("sys.dm_os_spinlock_stats", clear) 
DBCC SQLPERF("sys.dm_os_latch_stats"   , clear) 

CREATE PROCEDURE [dbo].[usp_Insert] 
     @BatchSize    int = 1
    ,@RowsToInsert int = 2000000 
AS 
BEGIN 
    SET NOCOUNT ON; 

    DECLARE  @i    INTEGER = 0
            ,@base BIGINT  = @@SPID * 10000000000; 

    WHILE @i < @RowsToInsert 
    BEGIN 
        BEGIN TRANSACTION 
            INSERT INTO dbo.MyBigTable (c1) 
                 VALUES (@base + @i); 
        COMMIT; 
        SET @i += 1; 
    END; 
END;

This is the test environment:

Test Environment

Trace flags 1117 and 8008 are enabled and the file group containing the table used for the test has one data file per logical processor; 40 in total. ostress from rml utilities (downloadable from here) is used to fire off the usp_insert stored procedure thus:

ostress -E -dimoltp -Q"exec usp_insertImolpt" -n<num threads>

Relieving the pressure on the LOGCACHE_ACCESS spinlock places more pressure on XDESMGR:

xdesmgr spins comparison

XDESMGR serializes access to the part of the database engine used to create and destroy transaction ids:

xdesmgr

There are several solutions for dealing with this spinlock:

  • Use processors that provide superior single threaded performance so that the spinlock can be acquired and released faster.
  • Batch multiple DML statements up per transaction.

Increasing the batch size by just one results in a big jump in insert performance:

xdesmgr batch size

There are several places to go after extinguishing all tuning options in terms of low hanging fruit with the disk based engine:

Tuning 101: When The Bottleneck Is Contention On A Single Resource, Create More Of Them !

By their very nature both the LOGCACHE_ACCESS and XDESMGR spinlocks are singleton resources, however you can ‘Shard’ a database across instances and ‘Affinitize’ each instance to a NUMA node / CPU socket:

shard instance

As per the bullet points above, this helps to kill two birds with one stone: you are no longer constrained to one LOGCACHE_ACCESS and XDESMGR spinlock, the cache lines associated with these spinlocks only have to burn 30 or so CPU cycles travelling across the level 3 cache instead of the quick path interconnect which links sockets. All cache lines inside the CPU cache are tagged with the main memory address they originate from.

cache lines

‘Affinitizing’ an instance to a CPU socket is tricky, in that there is no way of explicitly binding the log writer to a NUMA node and because windows server 2012 R2 is used for these tests, windows system resource manager is not an option ( without using a hack ). The one remaining option is to create two virtual machines and tie each one to its own CPU socket, The CPU affinity mask makes performing the  insert test on a single socket trivial and can indicate the results that the database sharding approach might yield, these are the results this gives us:

single socket

The test achieves a peak throughput of around 475,000 inserts per second for 18 threads, with both sockets the peak throughput was 500,000 inserts per second. Because the operating system loads one of the NUMA nodes up with threads, the results from the graph above are unlikely to translate to a 100% improvement in insert throughput if one instance per socket could be used, however if this is somewhere between 50 and 100% it’s still impressive.

Ulrich Drepper has produced some seminal white papers on memory, excerpts from these papers shed light on why spinlocks may inhibit database engine scalability on multi socket servers:

dreppers1

The CPU cache paper infers that a performance penalty is paid when passing spinlock cache lines around:

dreppers2 In Memory OLTP

The top three most spin intensive spinlocks when using the disk based database engine and a CPU core dedicated to the log writer are:

  • XDESMGR
  • LOGCACHE_ACCESS
  • LOGFLUSHQ

Lets perform the insert test with an in memory table:

CREATE TABLE [dbo].[MyBigTable]
(
     [c1] [bigint]   NOT NULL
    ,[c2] [datetime] NULL
    ,[c3] [char](111) COLLATE Latin1_General_CI_AS NULL
    ,[c4] [int]      NULL
    ,[c5] [int]      NULL
    ,[c6] [bigint]   NULL
    CONSTRAINT [PK_MyBigTable] PRIMARY KEY NONCLUSTERED HASH
    ( [c1] ) WITH ( BUCKET_COUNT = 16777216)
) WITH ( MEMORY_OPTIMIZED = ON
        ,DURABILITY       = SCHEMA_AND_DATA )

With hyper threading disabled and inserts into an in memory table and a regular stored procedure, we see a new entry into the top three spin intensive spinlocks:

BLOCKER_ENUM

There was almost no BLOCKER_ENUM spinlock activity when using a disk based table, but now there is, what does it do ?, its name reveals little, so lets dig into the call stack:

BLOCKER_ENUM

I suspect that the BLOCKER_ENUM might be a ‘Generic’ spinlock used in multiple scenarios, in this specific one it synchronizes transactions between the disk and in memory OLTP engines. Running the test again but with  natively compiled stored procedure:

CREATE PROCEDURE [dbo].[usp_insertImoltp]
      @BatchSize    int = 2
     ,@RowsToInsert int = 2000000
     ,@Spid         bigint
WITH NATIVE_COMPILATION, SCHEMABINDING
AS
BEGIN ATOMIC WITH ( TRANSACTION ISOLATION LEVEL = SNAPSHOT
                   ,LANGUAGE = N'English')
    DECLARE  @j INTEGER   = 0
            ,@base BIGINT = @Spid * 10000000000;

    WHILE @j < @RowsToInsert
    BEGIN
        INSERT INTO dbo.MyBigTable (c1)
            VALUES (@base + @j);
        SET @j += 1;
    END;
END;

The in memory OLTP database engine does not support @@SPID, this requires that @@SPID being passed into the stored procedure on the ostress command line:

ostress -E -dimoltp -Q"exec usp_insertImolpt @Spid=@@SPID" -n<num threads>

. . . and et voilà !, the BLOCKER_ENUM spinlock has dropped out of the top ten spin intensive spinlocks, let alone the top three:

BLOCKER_ENUM

Takeaways

  • The XDESMGR spinlock serializes access to the part of the disk based database engine that issues and destroys transaction ids.
  • Reduce pressure on the XDESMGR spinlock by:
    • Batch DML statements up per transaction.
    • Use CPUs with superior single threaded performance
    • Shard the database containing the table being inserted into across multiple instances
    • Use the in memory OLTP engine as this does not use the XDESMGR spinlock at all,
  • The mechanism that enables transactions to span both the disk and in-memory OLTP database engines uses the BLOCKER_ENUM spinlock.

One thought on “Tuning The XDESMGR Spinlock On A “Big box”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s