One of the many things I hope to get around to blogging about, time permitting, are the challenges of building “Web scale” platforms using SQL Server. Of the many challenges this presents, one is coming up with a platform which can service OLTP and OLAP workloads equally well when the platform cannot tolerate the latency of traditional style ETL processes. SQL Server 2016 introduces operational analytics a potential solution to this, however this has two key limitations:
- The in-memory OTLP engine does not have feature parity with the legacy engine (yet).
- The batch engine will never out perform queries using the legacy engine against pre-aggregated data.
A different tact is to use some form of queue to copy the data from a normalized to a de-normalized data model. Great care has to be taken when creating this queueing mechanism in order to avoid it becoming a bottleneck in itself.
Thomas Kejser has blogged ( here ) on the use of the LMax disruptor pattern as a means to implement a high performance queue on a relational database engine. In this post I want to explore this in depth, this will include:
- How well a LMax disruptor based queuing mechanism scales on SQL Server.
- What the bottle necks are for this approach.
- Where a SQL Server LMax disruptor queue burns its CPU time.
Implementing A Queue: The Naive Approach
The naive approach to implementing a queue is to use a table with a clustered index, push the data using a conventional INSERT and pop it off using a DELETE with an OUTPUT clause. Unfortunately this creates the perfect “Last page problem” storm in which all the insert threads contend on the right most leaf node in the b-tree and throughput is throttled by page latching:
This approach might scale well with in-memory OLTP engine which is latch free and uses hashing to distribute rows across buckets.
Introducing The “LMax Disruptor Pattern”
One solution to this proposed by Thomas is the use of the LMAX disruptor pattern, the crux of which is the implementation of a ring buffer:
Firstly we need to create the queue table:
CREATE TABLE dbo.MyQLmax ( [Slot] [bigint] NOT NULL ,[message_id] [bigint] NULL ,[time] [datetime] NOT NULL ,[message] [char](300) NOT NULL ,[reference_count] [tinyint] NOT NULL )
This will be seeded with 2000,000 messages:
SET NOCOUNT ON DECLARE @QueueSize int = 2000000 ,@i int = 1; WHILE @i <= @QueueSize BEGIN INSERT MyQLmax ( Slot ,time ,message ,reference_count) VALUES ( @i ,'2050-01-01' ,'dummy' ,0) SET @i = @i + 1 END;
To complete the queue, a clustered index will be created on the queue table:
CREATE UNIQUE CLUSTERED INDEX CIX ON dbo.MyQLmax (Slot) WITH (FILLFACTOR = 100)
A stored procedure and sequence object are required to pop messages onto the queue:
CREATE SEQUENCE dbo.PushSequence AS BIGINT START WITH 1 INCREMENT BY 1 MINVALUE 1 MAXVALUE 2000000 CYCLE CACHE 1000; GO CREATE TABLE [dbo].[PushStats] ( [SPID] [int] NULL ,[PushedMessageCount] [int] NULL ) GO CREATE PROCEDURE [dbo].[LmaxPush] AS BEGIN DECLARE @PushedMessageCount int = 0 ,@QueueSize int = 2000000 ,@Slot bigint ,@i int = 1; SET NOCOUNT ON; INSERT INTO PushStats ( [SPID] ,[PushedMessageCount] ) VALUES ( @@SPID ,0); WHILE @i <= @QueueSize BEGIN SET @Slot = NEXT VALUE FOR dbo.PushSequence UPDATE dbo.MyQLmax WITH (ROWLOCK) SET [time] = GETDATE() ,[message] = 'Hello World' ,[message_id] = @Slot ,reference_count = reference_count + 1 WHERE Slot = @Slot AND reference_count = 0 /* Don’t overwrite! */; IF @@ROWCOUNT = 1 BEGIN SET @PushedMessageCount += 1; END ELSE BEGIN SET @i += @QueueSize; END; SET @i += 1; END; UPDATE PushStats SET [PushedMessageCount] = @PushedMessageCount WHERE [SPID] = @@SPID; END;
The final pieces in the jigsaw are a sequence and stored procedure for popping messages off the queue:
CREATE SEQUENCE dbo.PopSequence AS BIGINT START WITH 1 INCREMENT BY 1 MINVALUE 1 MAXVALUE 2000000 CYCLE CACHE 1000; GO CREATE TABLE [dbo].[PopStats]( [SPID] [int] NULL ,[PushedMessageCount] [int] NULL ) ON [FG_01] GO CREATE PROCEDURE [dbo].[LmaxPop] AS BEGIN DECLARE @MessagesPopped int = 0 ,@OutputMessage char(300) ,@QueueReady int = 0 ,@QueueSize int = 2000000 ,@Slot bigint = 0 ,@i int = 1 SET NOCOUNT ON; /* * Allow the push procedure to build up a head of steam * in front of the pop procedure. */ WHILE <= 20000 BEGIN SELECT @QueueReady = SUM([reference_count]) FROM dbo.MyQLmax WITH (NOLOCK); IF @QueueReady <= 20000 BEGIN WAITFOR DELAY '00:00:00.001'; END; END; INSERT INTO PopStats ( [SPID] ,[MessagesPopped] ) VALUES ( @@SPID ,0); WHILE @i &amp;lt;= @QueueSize BEGIN SET @Slot = NEXT VALUE FOR dbo.PopSequence /* * Find slot */ UPDATE dbo.MyQLmax WITH (ROWLOCK) SET [time] = GETDATE() ,@OutputMessage = [message] ,[message_id] = NULL ,[reference_count] = [reference_count] - 1 WHERE [Slot] = @Slot AND [reference_count] > 0; IF @@ROWCOUNT = 0 BEGIN WAITFOR DELAY '00:00:00.001'; END; ELSE BEGIN SET @MessagesPopped += 1; END; SET @i += 1; END; UPDATE PopStats SET MessagesPopped = @MessagesPopped WHERE [SPID] = @@SPID; END;
The code provided in this blog post is free for anyone to use as long as the following two points are observed:
- The exact performance you will get is entirely dependent on the infrastructure you use for running SQL Server and on whatever else might be running on your instance at the time, the expectation for obtaining identical performance figures as those cited in this post is an unrealistic one.
- The initial ‘Cut’ of the code in this blog was written with the intention to provide a means of stress testing the LMax disruptor queue pattern and not as finished code fit for production code, however should you require this, please read the addendum below:
For anyone wishing to use this code in anger, it is is recommended that it be modified thus, my initial version was written purely as a means to stress test the LMax disruptor pattern and not with production purposes in mind, similarly in the code which follows which uses a memory optimised table and IDENTITY column, IDENT_CURRENT should be used to obtain the initial slot value, the memory optimised table should only then be inserted into if the ‘Pop’ update results in a @@ROWCOUNT value of 1. Similarly, a version of the push code which is fit for production purposes should iterate with a delay in the event that the ‘Queue’ is full until a free slot becomes available.
CREATE PROCEDURE [dbo].[LmaxPop] AS BEGIN DECLARE @MessagesPopped int = 0 ,@OutputMessage char(300) ,@QueueReady int = 0 ,@QueueSize int = 2000000 ,@Slot bigint = 0 ,@i int = 1 SET NOCOUNT ON; INSERT INTO PopStats ( [SPID] ,[MessagesPopped] ) VALUES ( @@SPID ,0); WHILE 1=1 BEGIN SELECT @Slot = current_value FROM sys.sequences WHERE name = 'PopSequence'; /* * Find slot */ UPDATE dbo.MyQLmax WITH (ROWLOCK) SET [time] = GETDATE() ,@OutputMessage = [message] ,[message_id] = NULL ,[reference_count] = [reference_count] - 1 WHERE [Slot] = @Slot AND [reference_count] > 0; IF @@ROWCOUNT = 0 BEGIN WAITFOR DELAY '00:00:00.001'; END; ELSE BEGIN SET @Slot = NEXT VALUE FOR PopSequence; SET @MessagesPopped += 1; END; SET @i += 1; END; UPDATE PopStats SET MessagesPopped = @MessagesPopped WHERE [SPID] = @@SPID; END;
The pop procedure has been coded with logic to allow the push procedure to create 20,000 messages in the queue before it will start removing them. Anyone wishing to use this code in anger for a practical application may want to reduce this number or remove this logic completely.
Time For A Test Drive !
Lets give this queue a test drive, with this test environment:
- 2 x 10 core Xeon E5 2660 2.2 Ghz hyperthreading enabled
- 64GB DDR4
- 4 x SanDisk Extreme Pro 480GB SSD
- Windows server 2012 R2
- SQL Server 2016 CTP 2.4
- Trace flags 8008 and 8048 enabled
ostress -E -S2016CTP24 -dLmax -Q”exec LmaxPush” -n1
This first run takes 6 minutes and 59 seconds to complete and this is what sys.dm_os_wait_stats has to say:
Overcoming The WRITELOG Barrier
For purely academic purposes, lets repeat this test with delayed durability. Were the queue to be used in anger in a real world application all message pushes to the queue should be durable. Had my server had a flash PCIe card with a NVMe driver, I would have got closer to eliminating WRITELOG waits without having to go down the delayed durability path.
This change results in a dramatic drop in elapsed time; a single threaded execution drops to 1 minutes and 5 seconds. sys.dm_os_wait_stats now reports:
But how well does this scale ?, the answer is not very well 😦
page latching which is the whole idea behind avoiding the naive approach has come back to haunt us:
Re-Thinking The Message Slot Key
The PAGELATCH_EX wait event accrues time when more than one thread requires access to the same page in the buffer cache for data modification purposes:
The resolution to this is to prevent this situation from taking place to begin with. Lets try changing the physical order of the queue pages on disk whilst preserving their logical order, this will require a new push procedure:
CREATE PROCEDURE [dbo].[LmaxPushV2] AS BEGIN DECLARE @PushedMessageCount int = 0 ,@QueueSize int = 2000000 ,@1billion bigint = 1000000000 ,@2billion bigint = 2000000000 ,@3billion bigint = 3000000000 ,@4billion bigint = 4000000000 ,@5billion bigint = 5000000000 ,@6billion bigint = 6000000000 ,@7billion bigint = 7000000000 ,@8billion bigint = 8000000000 ,@9billion bigint = 9000000000 ,@Slot bigint ,@i int = 1; SET NOCOUNT ON; INSERT INTO PushStats ( [SPID] ,[PushedMessageCount]) VALUES ( @@SPID ,0 ); WHILE @i <= @QueueSize BEGIN SET @Slot = NEXT VALUE FOR dbo.PushSequence SET @Slot = ( CASE RIGHT(@Slot, 1) WHEN 0 THEN @Slot WHEN 1 THEN @Slot + @5billion WHEN 2 THEN @Slot + @1billion WHEN 3 THEN @Slot + @8billion WHEN 4 THEN @Slot + @2billion WHEN 5 THEN @Slot + @7billion WHEN 6 THEN @Slot + @3billion WHEN 7 THEN @Slot + @6billion WHEN 8 THEN @Slot + @4billion WHEN 9 THEN @Slot + @9billion END ); UPDATE dbo.MyQLmax WITH (ROWLOCK) SET [time] = GETDATE() ,[message] = 'Hello World' ,[message_id] = @Slot ,reference_count = reference_count + 1 WHERE Slot = @Slot AND reference_count = 0 /* Don’t overwrite! */ IF @@ROWCOUNT = 1 BEGIN SET @PushedMessageCount += 1; END ELSE SET @i += @QueueSize; END; SET @i += 1; END; UPDATE PushStats SET [PushedMessageCount] = @PushedMessageCount WHERE [SPID] = @@SPID; END;
In addition to this a new INSERT statement is also required to seed create the slots in the queue table:
TRUNCATE TABLE MyQLmax GO NOCOUNT ON DECLARE @QueueSize int = 2000000 ,@Slot bigint ,@i int = 1; WHILE @i <= @QueueSize BEGIN INSERT MyQLmax ( Slot ,[time] ,[message] ,reference_count) SELECT (CASE RIGHT(@i, 1) WHEN 0 THEN @i WHEN 1 THEN @i + 5000000000 WHEN 2 THEN @i + 1000000000 WHEN 3 THEN @i + 8000000000 WHEN 4 THEN @i + 2000000000 WHEN 5 THEN @i + 7000000000 WHEN 6 THEN @i + 3000000000 WHEN 7 THEN @i + 6000000000 WHEN 8 THEN @i + 4000000000 WHEN 9 THEN @i + 9000000000 END ) ,'2050-01-01' ,'dummy' ,0 SET @i = @i + 1 END;
Eliminating PAGELATCH_EX waits results in much improved scalability !!!:
Takeaway #1 a naive approach to implementing the lmax disruptor pattern with slot numbers based on monotonically increasing values does not scale well.
SQL Server and Reverse Key Indexes
The method used to preserve the logical slot order but disrupt the physical order is a crude one and it needs to be deterministic, i.e. each value generated by the push sequence always has to correspond to the same pseudo random value, thus ruling out the use of GUIDs. A more elegant approach is to use a “Reverse key” index, this is a conventional b-tree index which stores the values of the indexed column(s) after they have under gone bit reversion. Oracle furnishes reverse key indexes out of the box, with SQL Server we have to craft reverse key indexes ourselves. I would refer anyone wishing to create reverse key indexes with SQL Server do this blog post which details how to do it.
Where Are Our CPU Cycles Being Burned ?
Based on what we know, we get diminishing returns as the number of threads pushing messages onto the queue increases beyond 8:
Therefore. lets take a look at where the CPU time is going for 8 threads, at module level to begin with:
sqlmin.dll encapsulates the database engine run time and sqllang.dll is used for T-SQL language processing and query optimization. Given that the push stored procedure is quite simple, sqllang.dll is burning up quite a lot of CPU time ( weight = sampled CPU time in milliseconds ). This should provide a hint as to what in-memory OLTP natively compiled procedures might provide in terms of a performance boost 😉
Lets drill down into sqlmin.dll:
Nothing in particular stands out as an excessive consumer of CPU time. SQLServerLogMgr::AppendLogRequest is the function used to copy logging information into the log buffer. The three function calls I have highlighted relate to locking and latching, two things that use of the in memory OTLP engine should eliminate.
Where Has The Bottleneck Moved To ?
The graph begs the question; why does throughput drop at 11 threads ?, wait statistics provides some clues:
. . . and sys.dm_os_latch_stats will reveal the exact latch:
Taken on face value this suggests that the latch is used to synchronize access to the sequence, windows performance toolkit wait analysis can provide a definitive answer as to what the METADATA_SEQUENCE_GENERATOR latch protects:
Immediately after the call to:
there is a call to:
I was expecting this to be a call to sqlmin.dll!LatchBase::Acquire. Information on sqlmin.dll!LatchBase::Suspend in the public domain is non existent bar for this blog post, the post implies that suspend is used to release orphaned latches.
Performance and Scalability In Two Pictures
If I was to summarize all my material, be it slide decks or blog posts in two images it would be these two; firstly the CPU cycle latency’s of accessing different parts of the memory hierarchy:
For any piece of software running on a modern processor to run fast trips off the CPU to main memory need to be minimized as much as possible. Performance ( the ability for something to “Run fast” ) and scalability are related, but are not the same.
The second image represents scale and anti-scale patterns, essentially whenever a shared resource is contended on, be it a lock, latch or spinlock, a situation exists which is not conducive to scalability:
The same principle applies to execution plans when the gather and distribute streams exchange iterators are present in the plan.
Can We Do Better ?
At this point we have two options for increasing the scalability of our code, use:
- Processors with a faster clock speed and / or superior single threaded performance which allows the latch protecting the sequence to be acquired and released faster.
- Something other than a sequence object to obtain monotonically increasing values from.
The ideal solution would be the availability of an in-memory sequence object protected by compare and swap (CAS) instructions. All of this begs the question. if a sequence does not scale that well with the legacy database engine, has the SQL Server engine team made some special provision for the in memory OLTP engine ?, at the time of writing this blog post I do not have an answer for this.
SQL Server, up to and including version 2016 does support the retrieval of the NEXT value from a sequence in natively compiled procedure.
One potential solution is to create a single column memory optimized table with an IDENTITY like this:
CREATE TABLE [dbo].[NonBlockingSequence] ( [ID] [bigint] IDENTITY(1,1) NOT NULL ,PRIMARY KEY NONCLUSTERED HASH ( [ID] ) WITH ( BUCKET_COUNT = 524288) ) WITH ( MEMORY_OPTIMIZED = ON, DURABILITY = SCHEMA_AND_DATA )
a sequence number ( identity ) is obtained from using this code excerpt:
BEGIN TRANSACTION INSERT INTO dbo.NonBlockingSequence DEFAULT VALUES; SELECT @Slot = SCOPE_IDENTITY() % @QueueSize; ROLLBACK TRANSACTION;
Note that this trick only works with an in-memory table due to its latch free nature.
How does this fly ?
Scalability is improved from 8 to 14 threads and the highest throughput obtained is 192,715 messages pushed / second with 20 threads. For completeness, below is the data table from which the graph was constructed:
Takeaway #2 under load a sequence object has limited scalability which can be overcome using a memory optimized table with an identity column. However, even though this approach eliminates latching, access to the IDENTITY value is protected internally via the DBTABLE spinlock.
This is the wait activity for the test with 22 threads:
Microsoft does not give much away as to what the WAIT_XTP_OFFLINE_CKPT_LOG_IO wait event represents other than to state that it relates to reads on the log file required in order to perform offline check points. Because a row shows up in sysprocesses for a thread waiting on this wait type that accumulates CPU time whilst the in memory table is not in use, WAIT_XTP_OFFLINE_CKPT_LOG_IO can probably be safely ignored.
Can The Throughput Curve Be Pushed Still Further ?
I’m in the process of writing a continuation of this post involving the in memory OLTP engine, this allows the LMax disruptor pattern to scale to over 550,000 messages pushes per second:
In this post I’ve dug into using the lmax disruptor pattern with the legacy database engine and addressed several issues which inhibit its scalability:
- log write performance
- page latch contention
- sequence generation scalability
Of these points, the last one is worthy of its own blog post. I suspect that there is a point at which the in memory table technique hits a scalability brick wall due to the sheer volume of DBTABLE spinlock spins. I also hope to cover the topic of queuing when using the in memory OLTP engine.