Super Scaling Queues Using the LMax Disruptor Pattern And The In-Memory OLTP Engine

The Story So Far

The graph below represents the throughput we managed to get ( from a warm buffer cache ) from the legacy database engine and the help of an in memory table as a scale-able means of generating sequences:

legacy engine lmax

Takeaways From Using The Legacy Engine

In the order of bottlenecks encountered, these were:

  • WRITELOG waits
    Re-mediated by switching to delayed durability.
  • PAGELATCH_SH waits
    Re-mediated by switching to a simple piece of code which took the least significant digit of the sequence and then adding a 1000,000,000 to it so as to ensure that slots in the queue were logically contiguous but not physically contiguous. Reference was made to a more elegant and scale-able reverse key index solution.
  • METADATA_SEQUENCE_GENERATOR – LATCH_SH waitsWe then hit significant wait times on the latch used to protect the push sequence, the solution to this problem was to use a memory optimized table with a single bigint column that had an identity associated with it. Sequence values were obtained by inserting into this and getting the latest identity value using SCOPE_IDENTITY().

Disclaimer

The material containing in this post represents a home experiment as such the reader should heed the following points:

  • The test server is a “Home build”, built from the best quality components I could afford and get my hands on, but a home build nonetheless.
  • The version of SQL Server I am using is a CTP, unless you have a business case which desperately needs the functionality in a CTP, CTP’s are generally avoided for production purposes.
  • In order to avoid what would otherwise be a very short post along the lines of . . .I hit the WRITELOG wall – the end, I use delayed durability, again some if not most production environments will not tolerate the risk of data loss. If anyone has a genuine requirement to implement a high performance queue that is 100% durable, I would recommend the use of a PCIe flash card that support a NVMe compliant driver, which is what I will be buying at some point this year.
  • The code excerpts in this blog have been written to demonstrate how certain parts of the database engine behave under pressure and not for production purposes.
  • Anyone wishing to reuse the code or concepts in this blog will not obtain exact same results as I have unless the test environment and test data used is identical to that used in this post.

Problem Statement

To push 2000,000 messages into a queue using the LMax disruptor pattern in the most scale-able way possible, the primary intention of this blog post is not to produce code fit for production, this will be addressed in a subsequent blog post.

What Limits The Scale-ability Of The Legacy Engine ?

Before logging records have been made durable you are likely to hit heavy spin related CPU usage on the LOGCACHE_ACCESS spinlock:

logcache_access

The second pressure point is the XDESMGR spinlock, this serializes access to the the part of the engine that issues and destroy transaction ids, that the in-memory OLTP engine does not use this spinlock if natively compiled procedures are used.

A third spinlock comes into play when in memory tables are subject to DML statements issued from non natively compiled stored procedures, this is the BLOCKER_ENUM spinlock and performs serialization for the mechanism that allows transactions to span both database engines.

How Does The In Memory OLTP Engine  Fly ?

This is out test environment:

  • Server with 2 x 10 core 2.2 Ghz Xeon V3 ( Haswell ) processors with hyper threading enabled and 64GB of DDR4 memory.
  • Windows Sevrer 2012 R2.
  • SQL Server 2016 CTP 2.4.
  • Trace flags 8008 and 8048 enabled.
  • delayed durability set to forced

and here is version 1.0 of the code that will be tested, note that due to a wordpress issue that i am yet to get to the bottom of, LTE is used in place of <= throughout this post:

CREATE PROCEDURE [dbo].[LmaxPushImOltp]
AS
BEGIN
    DECLARE  @QueueSize     int = 2000000
            ,@MessagePushed int
            ,@Slot          int
            ,@i             int = 0;

    WHILE @i LTE @QueueSize
    BEGIN
        BEGIN TRAN
            EXEC GetSlotId @Slot OUTPUT, @QueueSize;
        ROLLBACK TRAN;

        EXEC dbo.PushMessageImOltp  @Slot
                                   ,@MessagePushed OUTPUT;

        IF @MessagePushed = 0
        BEGIN
            SET @i += @QueueSize;
        END
        ELSE
        BEGIN
           SET @i += 1;
        END;
    END;
END;

CREATE PROCEDURE GetSlotId  @Slot int OUTPUT
                           ,@QueueSize int
WITH NATIVE_COMPILATION, SCHEMABINDING
AS
BEGIN ATOMIC
    WITH ( TRANSACTION ISOLATION LEVEL = SNAPSHOT
          ,LANGUAGE = N'us_english')
    INSERT INTO dbo.NonBlockingSequence
        DEFAULT VALUES;

    SELECT @Slot = SCOPE_IDENTITY() % @QueueSize;
END;

CREATE PROCEDURE PushMessageImOltp  @Slot int
                                   ,@MessagePushed int OUTPUT
WITH NATIVE_COMPILATION, SCHEMABINDING
AS
BEGIN ATOMIC
    WITH ( TRANSACTION ISOLATION LEVEL = SNAPSHOT
          ,LANGUAGE                    = N'us_english')

    DECLARE @QueueSize int = 2000000;

    UPDATE dbo.MyQLmaxImOltp
    SET    time            = GETDATE()
          ,message         = 'Hello World'
          ,message_id      = @Slot
          ,reference_count = reference_count + 1
    WHERE Slot = @Slot
    AND reference_count = 0;

    SET @MessagePushed = @@ROWCOUNT;
END;

Here are the T-SQL statements for the two tables used by the code:

CREATE TABLE MyQLmaxImOltp
(
     Slot            bigint       NOT NULL
    ,message_id      bigint       NULL
    ,time            datetime     NOT NULL
    ,message         char(300)
     COLLATE Latin1_General_CI_AS NOT NULL
    ,reference_count tinyint      NOT NULL,

     PRIMARY KEY NONCLUSTERED HASH (
         Slot
     ) WITH ( BUCKET_COUNT = 2097152)
) WITH ( MEMORY_OPTIMIZED = ON , DURABILITY = SCHEMA_AND_DATA )

CREATE TABLE dbo.NonBlockingSequence
(
     ID bigint IDENTITY(1,1) NOT NULL
    ,PRIMARY KEY NONCLUSTERED HASH ( ID ) WITH ( BUCKET_COUNT = 524288)
) WITH ( MEMORY_OPTIMIZED = ON , DURABILITY = SCHEMA_AND_DATA )

An initial test of the code results in this throughput graph, as before ostress issused to create multiple threads:

graph 1

 

Takeaway #1
Version 1 of the code does not scale that well beyond half the logical processors available to SQL Server

The actual throughput falls away from the theoretical ‘Perfect’ scale-ability curve at 23 threads and this is what the wait time looks like for this test:

waits

At 23 threads our workload is 100% service time bound, this should come as no great surprise as the in-memory OLTP engine eliminates lock and latch contention, therefore … Where Is Our CPU Time Going ?:

oltp v1 modules

What should stick out like a sore thumb is the amount of CPU time being burned by the sqllang.dll compared to the in-memory OLTP engine. Granted the in-memory engine is very efficient and the code used is natively compiled. However, considering the little that the legacy engine is being tasked with, this does seem excessive. Lets drill down into the modules to see what functions they are calling:

oltp v1 functions

oltp v1 functions

My initial thought was that there is certain overhead in calling natively compiled code from the legacy engine, hence the lines highlighted in blue above. What happens if the slot id is obtained inside the same natively compiled stored procedure as the one used to push messages onto the queue ?. This will require some code modifications, which should not be too onerous, firstly to the main stored procedure that drives the whole test:

ALTER PROCEDURE LmaxPushImOltp
AS
BEGIN
    DECLARE  @QueueSize     int = 2000000
            ,@MessagePushed int
            ,@Slot          int
            ,@i             int = 0;

    WHILE @i LTE @QueueSize
    BEGIN
        EXEC dbo.PushMessageImOltp  @Slot
                                   ,@MessagePushed OUTPUT;

        IF @MessagePushed = 0
        BEGIN
            SET @i += @QueueSize;
        END
        ELSE
        BEGIN
            SET @i += 1;
        END;
    END;
END;

and then to the message push procedure:

ALTER PROCEDURE PushMessageImOltp  @Slot int
                                  ,@MessagePushed int OUTPUT
WITH NATIVE_COMPILATION, SCHEMABINDING
AS
BEGIN ATOMIC
WITH ( TRANSACTION ISOLATION LEVEL = SNAPSHOT
      ,LANGUAGE                    = N'us_english')
    DECLARE @QueueSize int = 2000000;

    INSERT INTO dbo.NonBlockingSequence
        DEFAULT VALUES;

    SELECT @Slot = SCOPE_IDENTITY() % @QueueSize;

    UPDATE dbo.MyQLmaxImOltp
    SET     time            = GETDATE()
           ,message         = 'Hello World'
           ,message_id      = @Slot
           ,reference_count = reference_count + 1
    WHERE Slot = @Slot
    AND = 0;

    SET @MessagePushed = @@ROWCOUNT;
END;

How well does this new code scale ?:

graph 2

Takeaway #2
Reducing the amount of non natively compiled code used nearly double throughput.

Lets compare where the CPU time is going to for 15 threads with versions 1.0 and 2.0 of the code, firstly at module (.dll) level, for the first version of the code followed by the second version:

chris

chris

This reveals a significant drop in the CPU time used by:

  • The SQL Server instance overall
  • hkengine.dll
  • sqllang.dll

Lets drill down in hkengine.dll, as before with the initial version of the code and then the second version:

hkengine v1

hkengine v2

Of the top CPU consuming functions HkScopeEnter, which could be a variety of things, transaction scope springs to mind, has undergone the largest reduction in CPU time. However the functions ordered in descending CPU time (weight) reveal no single or even handful of functions that are responsible for the overall drop in CPU time, we appear to have achieved an increase in efficiency across the board. Will drilling down into sqllang.dll reveal anything of interest ?:

sqllang v1

sqllang v2

Again there is no one single function call that stands out as burning significantly less CPU time following the code change. The lines I’ve highlighted are for the functions I believe are the ones sqllang.dll invokes in order for natively compiled procedures to be called.

Reducing The Overhead of sqllang.dll Completely

Lets change the code again so as to make all the code used in the test natively compiled:

CREATE PROCEDURE [dbo].[LmaxPushImOltp]
WITH NATIVE_COMPILATION, SCHEMABINDING
AS
BEGIN ATOMIC
    WITH ( TRANSACTION ISOLATION LEVEL = SNAPSHOT
          ,LANGUAGE                    = N'us_english')

    DECLARE  @QueueSize     int = 2000000
            ,@MessagePushed int
            ,@Slot          int
            ,@i             int = 0;

    WHILE @i LTE @QueueSize
    BEGIN
        EXEC dbo.PushMessageImOltp  @Slot
                                   ,@MessagePushed OUTPUT;

        IF @MessagePushed = 0
        BEGIN
            SET @i += @QueueSize;
        END
        ELSE
        BEGIN
            SET @i += 1;
        END;
    END;
END;

An initial test of this code with a single thread results in a message queue push throughput of 200,000 messages per second, things are looking up . . . until the test is performed with 2 threads:

BANG

Unfortunately, because a natively compiled procedure has to have an atomic block by design, i.e. every DML statement in the procedure will either succeed or fail, if a natively compile procedure bombs out, the transaction cannot be retried within the procedure. The retry logic could be coded into a conventional stored procedure, however this would defeat the objective of this exercise in the first which is to do away with non-natively code.

Where are The In Memory OLTP Pressure Point Beyond 40 Logical Processors ?

At the time of writing this post, my test  hardware is constrained to 20 physical cores ( 10 per socket ). However, as the number of threads used in the test approaches 40 ( two hyper-threads per physical core), we can theorize on where the pressure points might be.

Lets try a test with 40 threads and version 2.0 of the code, we get 100% CPU utilization:

cpu usage

and this wait activity:

waits

but more interestingly, we are getting a considerable number of spins on the LOGCACHE_ACCESS spinlock:

spins

Takeaway #3
Despite the fact the in-memory OLTP engine does away with write ahead logging, its still possible for the LOGCACHE_ACCESS spinlock to become a bottleneck.

Lets repeat the test with the CPU affinity mask trick, 38 threads (not 40):

waits

LOGCACHE_ACCESS activity have now dropped significantly:

spins

what does the call stack have to say ?:

modules

Why when we are using the in memory OLTP engine is so much time being expended in the legacy engine run time ( sqlmin.dll ) ?, drilling down into sqlmin.dll to function level might provides some clues:

modules and funcs

Spinlock<62,16,1>::SpinToAcquireWithExponentialBackoff looks like it may well be our CMED_HASH_SET spinlock, this is where it appears in the call stack:

metadata spinlock

What Does CMED_HASH_SET Protect ?

Before the database engine executes a query it has to check a metadata cache in order to determine whether or not the objects it intends to use have been dropped or not. The only way to circumvent this would be to place our objects in the master database because the database engine is hard coded to assume that objects in master will never dropped. I am not suggesting for a second that anyone should try this, I am merely stating the way in which the database engine runs and thinks.

Takeaway #4
The ultimate bottleneck for version 2.0 of the code is the spinlock used for synchronizing access to the metadata cache.

To reuse a familiar phrase, there is no so such thing as a free lunch, the fact that the in-memory OLTP engine is latch and lock free should not obviate the fact that it is emphatically not spinlock free.

Wrap Up

This is the tongue in cheek “About me” – intro slide for the latest deck I have been presenting on the SQL community circuit:

scotty

Readers of a similar vintage as myself will be familiar with Scotty’s famous line:

Ye cannae change the laws of physics Jim

The laws of physics we are running head long into govern the speed with which the memory cache lines used to store spinlocks can be passed from CPU core to CPU core. As a recap this is how spinlocks are released and acquired:

spinlock_cacheline

The best case scenario is that the two threads involved in the spinlock-release-acquire process are running as hyper threads on the same physical CPU core:

Latency = CPU L1 cache latency

and the worst case is that the two CPU core reside on different sockets:

Latency = quick path interconnect latency

In practical terms the best we can do at CPU level is:

  • Go for processors with the fastest single threaded performance available.
  • Investigate Bios and power management options for making Intel’s clock speed enhancing turbo boost feature kick in.
  • Isolate the workload to a single socket so that the maximum CPU core to CPU core latency is that of the L3 cache and not the quick path interconnect.
  • Affinitize the rest of the database engine away from the core the log writer is running on, in the case of the LOGCACHE_ACCESS spinlock.

Yet again, we have come full circle in alluding to the one image which is so critical when it comes to making software perform and scale on a modern CPU:

The Good The Bad and The Ugly

 

 

 

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s