This post continues my work on the LMax disrupt or pattern, I’ve already covered this already, what I have not covered is:
- Spinlock profiling
- Wait statistic profiling
- How the in-memory engine now behaves with SQL Server 2016 CU2 in which the database engine team have optimised the metadata cache spinlock
What Is The LMax Disruptor Pattern ?
This is a highly scale-able means of allowing threads to converse via a ring buffer. You will invariably find that other people have come up with solutions and/or strategies for a particular problem that you might face. In this case LMax (London Multiateral Asset Exchange), an electronic forex trading company has developed a high performance queuing pattern known as the “LMax disruptor pattern”. My good friend Thomas Kejser blogged about this, however this pre-dated the in-memory engine, which I have used to investigate this. To boil this pattern down to its most fundamental premise, a ring buffer is used to queue data passed between threads.
When Would I Want To Use This ?
If you have any type of scenario in which needs to ingest data at a high rate in the order it was inserted, if you think of first in first out (FIFO) queuing and the shock absorb-er pattern, the LMax disrupt-or is a good solution to this.
Why Not Use A Clustered Index As A Queue ?
There are two good reason why you would want to use the a ring buffer in preference to this, firstly is you want a construct to act as a shock absorb-er which behaves like a first in first out queue, the only way this approach will ever scale with the legacy engine is to have one clustered index per message handling thread. Otherwise if the queue is based on a single clustered index that has a cluster key that is based on a monotonically increasing value, you get zero scalability due to what is known as the “Last page problem”:
If you randomise the key, say use a guid or roll your own reverse key index, concurrent inserts into a clustered index will scale, however neither of these approaches will allow the inserted values to be retrieved in first in first out order.
How Does This Work ?
We have a fixed sized queue with a number of slots, each slot has a slot number, which uniquely identifies that slot, the slot has a payload which takes the message and a reference count to indicate whether the slot is in use or not. To ‘Push’ a message we locate an empty slot and update it to contain both a message and a reference count of one. Note that when implemented using SQL Server, or any other relational database for that matter, the only operation that ever takes place in terms of DML is an UPDATE.
I Receive Transactions 24×7, 365 Days A Year, So How Can The Shock Absorb-er Pattern Apply To Me ?
If your customers are active around the clock and the shock absorb-er pattern requires some lull in activity, say for example you are an on-line gaming company, how is this pattern applicable to us ? you might ask. The answer would involve partitioning the databases, or shard-ing your databases by timezone, so that you can guarantee that there will be some time when each database is quiet, typically the small hours of the morning.
LMax and The Disk Store Engine 1 st Attempt
This is the code to initialise the queue by creating slots in it:
CREATE PROCEDURE [dbo].[usp_LMaxDiskQSlotInit] AS BEGIN DECLARE @QueueSize int = 4000000 ,@Slot bigint ,@i int = 0; SET NOCOUNT ON; TRUNCATE TABLE [dbo].[MyQLmax]; TRUNCATE TABLE [dbo].[MyQLmaxNode0]; TRUNCATE TABLE [dbo].[MyQLmaxNode1]; WHILE @i < @QueueSize BEGIN SET @Slot = (CASE RIGHT(@Slot, 1) WHEN 1 THEN @Slot + 5000000000 WHEN 2 THEN @Slot + 1000000000 WHEN 3 THEN @Slot + 8000000000 WHEN 4 THEN @Slot + 2000000000 WHEN 5 THEN @Slot + 7000000000 WHEN 6 THEN @Slot + 3000000000 WHEN 7 THEN @Slot + 6000000000 WHEN 8 THEN @Slot + 4000000000 WHEN 9 THEN @Slot + 9000000000 ELSE @Slot END ); INSERT INTO [dbo].[MyQLmax] ( [Slot] ,[message_id] ,[time] ,[message] ,[reference_count]) VALUES ( @Slot ,0 ,GETDATE() ,'' ,0); SET @i += 1; END; END; CREATE PROCEDURE [dbo].[usp_LmaxPushDiskSequence] @TransactionsPerThread int = 200000 AS BEGIN DECLARE @MessagePushed int ,@Slot int ,@i int = 0; SET NOCOUNT ON; WHILE @i <= @TransactionsPerThread BEGIN EXEC dbo.usp_PushMessageDiskSequence @MessagePushed OUTPUT; SET @i += 1; END; END; CREATE PROCEDURE [dbo].[usp_PushMessageDiskSequence] @MessagePushed int OUTPUT ,@QueueSize int = 4000000 AS BEGIN DECLARE @Slot bigint; SET NOCOUNT ON; SET @MessagePushed = 0; WHILE @MessagePushed = 0 BEGIN SELECT @Slot = NEXT VALUE FOR [dbo].[PushSequence] % @QueueSize; SET @Slot = (CASE RIGHT(@Slot, 1) WHEN 1 THEN @Slot + 5000000000 WHEN 2 THEN @Slot + 1000000000 WHEN 3 THEN @Slot + 8000000000 WHEN 4 THEN @Slot + 2000000000 WHEN 5 THEN @Slot + 7000000000 WHEN 6 THEN @Slot + 3000000000 WHEN 7 THEN @Slot + 6000000000 WHEN 8 THEN @Slot + 4000000000 WHEN 9 THEN @Slot + 9000000000 ELSE @Slot END ); UPDATE [dbo].[MyQLmax] SET time = GETDATE() ,message = 'Hello world' ,message_id = @Slot ,reference_count = reference_count + 1 WHERE Slot = @Slot AND reference_count = 0 OPTION (MAXDOP 1); SET @MessagePushed = @@ROWCOUNT; END; END;
This is the test environment which will be used throughout this blog post:
A workload which purely involves the pushing of messages results in the following throughput graphs which represent the same test performed with and without delayed durability:
Pushing messages onto the queue and popping them off simultaneously results in the following graphs:
Digging into wait activity for the test that uses delayed durability and only the pushing of messages reveals the following:
Drilling down into the latch stats, the METADATA_SEQUENCE_GENERATOR is the culprit behind the LATCH_SH waits:
This is how the METADATA_SEQUENCE_GENERATOR latch wait appears in the call stack, I’ve used a technique with windows performance toolkit know as wait analysis. Some of the terminology is not particularly intuitive, however the ‘New’ thread stack is the stack of the thread we are switching away from and the ‘Ready’ thread stack is that of the thread that has ‘Readied’ the new thread by satisfying the event it was waiting on.
The In-Memory Engine
This is an initial attempt at looking how the pushing of messages scales when using the in-memory engine, as we can see we do not even get the same throughout achieved with the legacy disk engine:
This graph suggests that the extra pressure being placed on the METADATA_SEQUENCE_GENERATOR latch is responsible for this:
Using the disk row store engine caused a maximum wait time of just under 600,000 ms for 24 concurrent message push threads, the same number of threads nearly doubles this for the in-memory engine. It is probably about time we looked at removing the bottleneck caused by the sequence object . . .
Crafting A Lock Free Latch Free Sequence
If we take a skinny memory optimised table with an identity column, a default value can be inserted into the table and the value of the IDENTITY column obtained via SCOPE_IDENTITY(), this is illustrated in the sample code below:
CREATE TABLE [dbo].[NonBlockingSequence] ( [ID] BIGINT IDENTITY (1, 1) NOT NULL, PRIMARY KEY NONCLUSTERED HASH ([ID]) WITH (BUCKET_COUNT = 524288) ) WITH (DURABILITY = SCHEMA_ONLY, MEMORY_OPTIMIZED = ON); CREATE PROCEDURE [dbo].[usp_PushMessageImOltpNoSequence] WITH NATIVE_COMPILATION, SCHEMABINDING AS BEGIN ATOMIC WITH ( TRANSACTION ISOLATION LEVEL = SNAPSHOT ,LANGUAGE = N'us_english') DECLARE @MessagePushed int ,@QueueSize int = 4000000 ,@Slot bigint; SET @MessagePushed = 0; WHILE @MessagePushed = 0 BEGIN INSERT INTO [dbo].[NonBlockingPushSequence] DEFAULT VALUES; SELECT @Slot = SCOPE_IDENTITY(); DELETE FROM [dbo].[NonBlockingPushSequence] WHERE ID = @Slot; UPDATE [dbo].[MyQLmaxImOltp] SET time = GETDATE() ,message = 'Hello world' ,message_id = @Slot ,reference_count = reference_count + 1 WHERE Slot = @Slot % @QueueSize AND reference_count = 0; SET @MessagePushed = @@ROWCOUNT; END; END;
This is what message throughput looks like when only messages are pushed onto the queue:
The one key takeaway point from this graph is the fact that the “Lock and latch free sequence” has emphatically done its job in improving throughput.
O(1) Versus O(N) Performance
“Big O” notation is used to describe the time complexity of an algorithm, when there are few rows per bucket, in-memory inserts follow O(1) behaviour. Simply put, the time to perform the inserts is constant irrespective of the input data set size. As the number of rows per bucket increases, so does the number of linear scans of the bucket linked list to do the same job. The time complexity of the insert becomes O(N); or in other words, elapsed time grows in a linear manner that is proportional to the size of the input data set.
“Noisy-ness” Of The Graph
The second takeaway point from the graphs is that although they follow a general upward trend in throughput as the number of threads increases, the graphs are very jagged. A fact that cannot be attributed to hyper-threading because this is turned off. My next port of call when I see this type of behaviour is to alter the scheduling algorithm using trace flag 8008, however this made no difference. At the time of writing this blog post I am at a loss as what is behind this, one school of thought that has been proposed to me is that the scheduling algorithm is being overloaded.
This behavior cannot be explained away by wait or spin-lock stats. Note that to have a spin-lock issue worth investigating you need to see spins of the order of billions. Also I have repeated the same tests with SQL Server 2016 SP1 CU3 and this has made no difference either.
The Overhead Of Logging
So far delayed durability has been used with DURABILITY set to SCHEMA_AND_DATA on the queue table, this begs the question, what happens when DURABILITY is set to SCHEMA_ONLY, this is what happens:
The graphs illustrate that doing away with durability full stop does not translate into any improvement in either scale-ability of performance.
Large Page Support
The next “Tuning knob” to try was to enable large page support, for the purposes of clarity, this is pages in the context of how the database engine memory management framework manages memory and not pages as in the 8K pages which make up the buffer pool.
I’ve highlighted a part of the graph between 1 and 4 threads, this is where the test using large pages outperforms its counterpart which does not use large pages. Large page support enhances the performance of the database engine in that it allows the CPU page table to store larger virtual to physical memory maps. My theory as to why the test that uses large pages ceases to outperform the one that does not up to four threads, is that thrashing of the memory manager page table takes place from five threads onwards.
The Performance Final Frontier
Once all conventional bottlenecks have been eliminated, locking, latching, spin-locking and anything associated with durability, do we have any bottlenecks left to tackle ?. To answer this lets first look at wait activity:
Most of the wait activity is attributable to SOS_SCHEDULER_YIELD, in isolation this tells us little other than the fact that the test causes high and sustained CPU utilisation. Looking at spin-lock activity, LOCK_RW_CMED_HASH_SET is the dominant spin-lock in terms of spin activity. LOCK_RW_CMED_HASH_SET was introduced in SQL Server 2016 CU2 as the solution to CMED_HASH_SET (the metadata cache spinlock) consuming excessive CPU cycles:
Where Are Our CPU Cycles Going ?
Using windows performance toolkit we can obtain a 360 degree view of where the (sampled) CPU cycles are being expended by the database engine:
As we might expect, most of the CPU is being expended in hkengine.dll, drilling down into the function this calls reveals that there is nothing that sticks out in terms of a function that consumes an inordinate amount of CPU.
Lets now take a look at what sqllang.dll is doing:
Considering that all we are doing in the legacy engine is executing a natively compiled stored procedure in a loop, sqllang.dll is burning up quite a bit of CPU, note the call to HKProc:FExecuteInternal<0>. Locating this function in the call stack suggests it is associated with switching between the context of the legacy engine and the in-memory engine:
This event trace for windows information has been obtained using the very latest cumulative update for service pack 1 of SQL Server 2016. One of the nuances of using natively compiled stored procedures is that if a developer wants to build transaction retry logic into their code, this has to be done by invoking natively compiled stored procedures from non-compiled stored procedures. Clearly there is an overhead when jumping between the legacy engine and the in memory engine.
- WRITELOG waits
Readers of my posts will know that I use delayed durability a lot. Whilst there are many real world production scenarios in which full durability is a hard and fast requirement, in order to get past the WRITELOG barrier and explore other pressure points placed in the engine, the use of delayed durability is a must.
- Lock free and latch free sequence generation
The fastest and most scale-able way of generating sequences with SQL Server is to insert a row into a memory optimised table with a single identity column that has durability=schema_only and then use SCOPE_IDENTITY().
- Scale-ability Without Durability
Forgoing all durability on the memory optimised queue table via DURABILITY=SCHEMA_ONLY resulted in no increase in performance and scale-ability compared to the exact same test using DURABILITY=SCHEMA_AND_DATA.
- In-Memory Hash Indexes Scale-ability With Large Page Support
Testing revealed that there is a point past which the translation look-aside buffer gets saturated to the extent that large memory pages only provided an increase in performance and scale-ability up to four threads.
- Erratic Throughput Scale-ability Graphs For The In-Memory Engine
The message throughput graphs for the in-memory engine are highly erratic, at the time of writing this blog post I have no explanation for why this is.
- The Overhead Of Switching Between Engine
There is a minor overhead of switching between the legacy engine and the in-memory engine
- Ingesting Data Into The Database Engine To Be Processed In FIFO Order
The fastest way to ingest data into the database for processing in FIFO order is to implement the LMax disrupt-or pattern.