One of the many things I hope to get around to blogging about, time permitting, are the challenges of building “Web scale” platforms using SQL Server. Of the many challenges this presents, one is coming up with a platform which can service OLTP and OLAP workloads equally well when the platform cannot tolerate the latency of traditional style ETL processes. SQL Server 2016 introduces operational analytics a potential solution to this, however this has two key limitations:
- The in-memory OTLP engine does not have feature parity with the legacy engine (yet).
- The batch engine will never out perform queries using the legacy engine against pre-aggregated data.
A different tact is to use some form of queue to copy the data from a normalized to a de-normalized data model. Great care has to be taken when creating this queueing mechanism in order to avoid it becoming a bottleneck in itself.
Thomas Kejser has blogged ( here ) on the use of the LMax disruptor pattern as a means to implement a high performance queue on a relational database engine. In this post I want to explore this in depth, this will include:
- How well a LMax disruptor based queuing mechanism scales on SQL Server.
- What the bottle necks are for this approach.
- Where a SQL Server LMax disruptor queue burns its CPU time.
Implementing A Queue: The Naive Approach
The naive approach to implementing a queue is to use a table with a clustered index, push the data using a conventional INSERT and pop it off using a DELETE with an OUTPUT clause. Unfortunately this creates the perfect “Last page problem” storm in which all the insert threads contend on the right most leaf node in the b-tree and throughput is throttled by page latching:
This approach might scale well with in-memory OLTP engine which is latch free and uses hashing to distribute rows across buckets.
Introducing The “LMax Disruptor Pattern”
One solution to this proposed by Thomas is the use of the LMAX disruptor pattern, the crux of which is the implementation of a ring buffer:
Firstly we need to create the queue table:
CREATE TABLE dbo.MyQLmax ( [Slot] [bigint] NOT NULL ,[message_id] [bigint] NULL ,[time] [datetime] NOT NULL ,[message] [char](300) NOT NULL ,[reference_count] [tinyint] NOT NULL )
This will be seeded with 2000,000 messages:
SET NOCOUNT ON DECLARE @QueueSize int = 2000000 ,@i int = 1; WHILE @i <= @QueueSize BEGIN INSERT MyQLmax ( Slot ,time ,message ,reference_count) VALUES ( @i ,'2050-01-01' ,'dummy' ,0) SET @i = @i + 1 END;
To complete the queue, a clustered index will be created on the queue table:
CREATE UNIQUE CLUSTERED INDEX CIX ON dbo.MyQLmax (Slot) WITH (FILLFACTOR = 100)
A stored procedure and sequence object are required to pop messages onto the queue:
CREATE SEQUENCE dbo.PushSequence AS BIGINT START WITH 1 INCREMENT BY 1 MINVALUE 1 MAXVALUE 2000000 CYCLE CACHE 1000; GO CREATE TABLE [dbo].[PushStats] ( [SPID] [int] NULL ,[PushedMessageCount] [int] NULL ) GO CREATE PROCEDURE [dbo].[LmaxPush] AS BEGIN DECLARE @PushedMessageCount int = 0 ,@QueueSize int = 2000000 ,@Slot bigint ,@i int = 1; SET NOCOUNT ON; INSERT INTO PushStats ( [SPID] ,[PushedMessageCount] ) VALUES ( @@SPID ,0); WHILE @i <= @QueueSize BEGIN SET @Slot = NEXT VALUE FOR dbo.PushSequence UPDATE dbo.MyQLmax WITH (ROWLOCK) SET [time] = GETDATE() ,[message] = 'Hello World' ,[message_id] = @Slot ,reference_count = reference_count + 1 WHERE Slot = @Slot AND reference_count = 0 /* Don’t overwrite! */; IF @@ROWCOUNT = 1 BEGIN SET @PushedMessageCount += 1; END ELSE BEGIN SET @i += @QueueSize; END; SET @i += 1; END; UPDATE PushStats SET [PushedMessageCount] = @PushedMessageCount WHERE [SPID] = @@SPID; END;
The final pieces in the jigsaw are a sequence and stored procedure for popping messages off the queue:
CREATE SEQUENCE dbo.PopSequence AS BIGINT START WITH 1 INCREMENT BY 1 MINVALUE 1 MAXVALUE 2000000 CYCLE CACHE 1000; GO CREATE TABLE [dbo].[PopStats]( [SPID] [int] NULL ,[PushedMessageCount] [int] NULL ) ON [FG_01] GO CREATE PROCEDURE [dbo].[LmaxPop] AS BEGIN DECLARE @MessagesPopped int = 0 ,@OutputMessage char(300) ,@QueueReady int = 0 ,@QueueSize int = 2000000 ,@Slot bigint = 0 ,@i int = 1 SET NOCOUNT ON; /* * Allow the push procedure to build up a head of steam * in front of the pop procedure. */ WHILE <= 20000 BEGIN SELECT @QueueReady = SUM([reference_count]) FROM dbo.MyQLmax WITH (NOLOCK); IF @QueueReady <= 20000 BEGIN WAITFOR DELAY '00:00:00.001'; END; END; INSERT INTO PopStats ( [SPID] ,[MessagesPopped] ) VALUES ( @@SPID ,0); WHILE @i &amp;lt;= @QueueSize BEGIN SET @Slot = NEXT VALUE FOR dbo.PopSequence /* * Find slot */ UPDATE dbo.MyQLmax WITH (ROWLOCK) SET [time] = GETDATE() ,@OutputMessage = [message] ,[message_id] = NULL ,[reference_count] = [reference_count] - 1 WHERE [Slot] = @Slot AND [reference_count] > 0; IF @@ROWCOUNT = 0 BEGIN WAITFOR DELAY '00:00:00.001'; END; ELSE BEGIN SET @MessagesPopped += 1; END; SET @i += 1; END; UPDATE PopStats SET MessagesPopped = @MessagesPopped WHERE [SPID] = @@SPID; END;
Disclaimer
The code provided in this blog post is free for anyone to use as long as the following two points are observed:
- The exact performance you will get is entirely dependent on the infrastructure you use for running SQL Server and on whatever else might be running on your instance at the time, the expectation for obtaining identical performance figures as those cited in this post is an unrealistic one.
- The initial ‘Cut’ of the code in this blog was written with the intention to provide a means of stress testing the LMax disruptor queue pattern and not as finished code fit for production code, however should you require this, please read the addendum below:
Addendum 12/01/2015
For anyone wishing to use this code in anger, it is is recommended that it be modified thus, my initial version was written purely as a means to stress test the LMax disruptor pattern and not with production purposes in mind, similarly in the code which follows which uses a memory optimised table and IDENTITY column, IDENT_CURRENT should be used to obtain the initial slot value, the memory optimised table should only then be inserted into if the ‘Pop’ update results in a @@ROWCOUNT value of 1. Similarly, a version of the push code which is fit for production purposes should iterate with a delay in the event that the ‘Queue’ is full until a free slot becomes available.
CREATE PROCEDURE [dbo].[LmaxPop] AS BEGIN DECLARE @MessagesPopped int = 0 ,@OutputMessage char(300) ,@QueueReady int = 0 ,@QueueSize int = 2000000 ,@Slot bigint = 0 ,@i int = 1 SET NOCOUNT ON; INSERT INTO PopStats ( [SPID] ,[MessagesPopped] ) VALUES ( @@SPID ,0); WHILE 1=1 BEGIN SELECT @Slot = current_value FROM sys.sequences WHERE name = 'PopSequence'; /* * Find slot */ UPDATE dbo.MyQLmax WITH (ROWLOCK) SET [time] = GETDATE() ,@OutputMessage = [message] ,[message_id] = NULL ,[reference_count] = [reference_count] - 1 WHERE [Slot] = @Slot AND [reference_count] > 0; IF @@ROWCOUNT = 0 BEGIN WAITFOR DELAY '00:00:00.001'; END; ELSE BEGIN SET @Slot = NEXT VALUE FOR PopSequence; SET @MessagesPopped += 1; END; SET @i += 1; END; UPDATE PopStats SET MessagesPopped = @MessagesPopped WHERE [SPID] = @@SPID; END;
The pop procedure has been coded with logic to allow the push procedure to create 20,000 messages in the queue before it will start removing them. Anyone wishing to use this code in anger for a practical application may want to reduce this number or remove this logic completely.
Time For A Test Drive !
Lets give this queue a test drive, with this test environment:
- 2 x 10 core Xeon E5 2660 2.2 Ghz hyperthreading enabled
- 64GB DDR4
- 4 x SanDisk Extreme Pro 480GB SSD
- Windows server 2012 R2
- SQL Server 2016 CTP 2.4
- Trace flags 8008 and 8048 enabled
ostress -E -S2016CTP24 -dLmax -Q”exec LmaxPush” -n1
This first run takes 6 minutes and 59 seconds to complete and this is what sys.dm_os_wait_stats has to say:
Overcoming The WRITELOG Barrier
For purely academic purposes, lets repeat this test with delayed durability. Were the queue to be used in anger in a real world application all message pushes to the queue should be durable. Had my server had a flash PCIe card with a NVMe driver, I would have got closer to eliminating WRITELOG waits without having to go down the delayed durability path.
This change results in a dramatic drop in elapsed time; a single threaded execution drops to 1 minutes and 5 seconds. sys.dm_os_wait_stats now reports:
But how well does this scale ?, the answer is not very well 😦
page latching which is the whole idea behind avoiding the naive approach has come back to haunt us:
Re-Thinking The Message Slot Key
The PAGELATCH_EX wait event accrues time when more than one thread requires access to the same page in the buffer cache for data modification purposes:
The resolution to this is to prevent this situation from taking place to begin with. Lets try changing the physical order of the queue pages on disk whilst preserving their logical order, this will require a new push procedure:
CREATE PROCEDURE [dbo].[LmaxPushV2] AS BEGIN DECLARE @PushedMessageCount int = 0 ,@QueueSize int = 2000000 ,@1billion bigint = 1000000000 ,@2billion bigint = 2000000000 ,@3billion bigint = 3000000000 ,@4billion bigint = 4000000000 ,@5billion bigint = 5000000000 ,@6billion bigint = 6000000000 ,@7billion bigint = 7000000000 ,@8billion bigint = 8000000000 ,@9billion bigint = 9000000000 ,@Slot bigint ,@i int = 1; SET NOCOUNT ON; INSERT INTO PushStats ( [SPID] ,[PushedMessageCount]) VALUES ( @@SPID ,0 ); WHILE @i <= @QueueSize BEGIN SET @Slot = NEXT VALUE FOR dbo.PushSequence SET @Slot = ( CASE RIGHT(@Slot, 1) WHEN 0 THEN @Slot WHEN 1 THEN @Slot + @5billion WHEN 2 THEN @Slot + @1billion WHEN 3 THEN @Slot + @8billion WHEN 4 THEN @Slot + @2billion WHEN 5 THEN @Slot + @7billion WHEN 6 THEN @Slot + @3billion WHEN 7 THEN @Slot + @6billion WHEN 8 THEN @Slot + @4billion WHEN 9 THEN @Slot + @9billion END ); UPDATE dbo.MyQLmax WITH (ROWLOCK) SET [time] = GETDATE() ,[message] = 'Hello World' ,[message_id] = @Slot ,reference_count = reference_count + 1 WHERE Slot = @Slot AND reference_count = 0 /* Don’t overwrite! */ IF @@ROWCOUNT = 1 BEGIN SET @PushedMessageCount += 1; END ELSE SET @i += @QueueSize; END; SET @i += 1; END; UPDATE PushStats SET [PushedMessageCount] = @PushedMessageCount WHERE [SPID] = @@SPID; END;
In addition to this a new INSERT statement is also required to seed create the slots in the queue table:
TRUNCATE TABLE MyQLmax GO NOCOUNT ON DECLARE @QueueSize int = 2000000 ,@Slot bigint ,@i int = 1; WHILE @i <= @QueueSize BEGIN INSERT MyQLmax ( Slot ,[time] ,[message] ,reference_count) SELECT (CASE RIGHT(@i, 1) WHEN 0 THEN @i WHEN 1 THEN @i + 5000000000 WHEN 2 THEN @i + 1000000000 WHEN 3 THEN @i + 8000000000 WHEN 4 THEN @i + 2000000000 WHEN 5 THEN @i + 7000000000 WHEN 6 THEN @i + 3000000000 WHEN 7 THEN @i + 6000000000 WHEN 8 THEN @i + 4000000000 WHEN 9 THEN @i + 9000000000 END ) ,'2050-01-01' ,'dummy' ,0 SET @i = @i + 1 END;
Eliminating PAGELATCH_EX waits results in much improved scalability !!!:
Takeaway #1 a naive approach to implementing the lmax disruptor pattern with slot numbers based on monotonically increasing values does not scale well.
SQL Server and Reverse Key Indexes
The method used to preserve the logical slot order but disrupt the physical order is a crude one and it needs to be deterministic, i.e. each value generated by the push sequence always has to correspond to the same pseudo random value, thus ruling out the use of GUIDs. A more elegant approach is to use a “Reverse key” index, this is a conventional b-tree index which stores the values of the indexed column(s) after they have under gone bit reversion. Oracle furnishes reverse key indexes out of the box, with SQL Server we have to craft reverse key indexes ourselves. I would refer anyone wishing to create reverse key indexes with SQL Server do this blog post which details how to do it.
Where Are Our CPU Cycles Being Burned ?
Based on what we know, we get diminishing returns as the number of threads pushing messages onto the queue increases beyond 8:
Therefore. lets take a look at where the CPU time is going for 8 threads, at module level to begin with:
sqlmin.dll encapsulates the database engine run time and sqllang.dll is used for T-SQL language processing and query optimization. Given that the push stored procedure is quite simple, sqllang.dll is burning up quite a lot of CPU time ( weight = sampled CPU time in milliseconds ). This should provide a hint as to what in-memory OLTP natively compiled procedures might provide in terms of a performance boost 😉
Lets drill down into sqlmin.dll:
Nothing in particular stands out as an excessive consumer of CPU time. SQLServerLogMgr::AppendLogRequest is the function used to copy logging information into the log buffer. The three function calls I have highlighted relate to locking and latching, two things that use of the in memory OTLP engine should eliminate.
Where Has The Bottleneck Moved To ?
The graph begs the question; why does throughput drop at 11 threads ?, wait statistics provides some clues:
. . . and sys.dm_os_latch_stats will reveal the exact latch:
Taken on face value this suggests that the latch is used to synchronize access to the sequence, windows performance toolkit wait analysis can provide a definitive answer as to what the METADATA_SEQUENCE_GENERATOR latch protects:
Immediately after the call to:
sqlmin.dll!CMEDSeqGen::GenerateNewValue
there is a call to:
sqlmin.dll!LatchBase::Suspend
I was expecting this to be a call to sqlmin.dll!LatchBase::Acquire. Information on sqlmin.dll!LatchBase::Suspend in the public domain is non existent bar for this blog post, the post implies that suspend is used to release orphaned latches.
Performance and Scalability In Two Pictures
If I was to summarize all my material, be it slide decks or blog posts in two images it would be these two; firstly the CPU cycle latency’s of accessing different parts of the memory hierarchy:
For any piece of software running on a modern processor to run fast trips off the CPU to main memory need to be minimized as much as possible. Performance ( the ability for something to “Run fast” ) and scalability are related, but are not the same.
The second image represents scale and anti-scale patterns, essentially whenever a shared resource is contended on, be it a lock, latch or spinlock, a situation exists which is not conducive to scalability:
The same principle applies to execution plans when the gather and distribute streams exchange iterators are present in the plan.
Can We Do Better ?
At this point we have two options for increasing the scalability of our code, use:
- Processors with a faster clock speed and / or superior single threaded performance which allows the latch protecting the sequence to be acquired and released faster.
- Something other than a sequence object to obtain monotonically increasing values from.
The ideal solution would be the availability of an in-memory sequence object protected by compare and swap (CAS) instructions. All of this begs the question. if a sequence does not scale that well with the legacy database engine, has the SQL Server engine team made some special provision for the in memory OLTP engine ?, at the time of writing this blog post I do not have an answer for this.
Addendum 08/01/2016
SQL Server, up to and including version 2016 does support the retrieval of the NEXT value from a sequence in natively compiled procedure.
One potential solution is to create a single column memory optimized table with an IDENTITY like this:
CREATE TABLE [dbo].[NonBlockingSequence] ( [ID] [bigint] IDENTITY(1,1) NOT NULL ,PRIMARY KEY NONCLUSTERED HASH ( [ID] ) WITH ( BUCKET_COUNT = 524288) ) WITH ( MEMORY_OPTIMIZED = ON, DURABILITY = SCHEMA_AND_DATA )
a sequence number ( identity ) is obtained from using this code excerpt:
BEGIN TRANSACTION INSERT INTO dbo.NonBlockingSequence DEFAULT VALUES; SELECT @Slot = SCOPE_IDENTITY() % @QueueSize; ROLLBACK TRANSACTION;
Note that this trick only works with an in-memory table due to its latch free nature.
How does this fly ?
Scalability is improved from 8 to 14 threads and the highest throughput obtained is 192,715 messages pushed / second with 20 threads. For completeness, below is the data table from which the graph was constructed:
Takeaway #2 under load a sequence object has limited scalability which can be overcome using a memory optimized table with an identity column. However, even though this approach eliminates latching, access to the IDENTITY value is protected internally via the DBTABLE spinlock.
This is the wait activity for the test with 22 threads:
Microsoft does not give much away as to what the WAIT_XTP_OFFLINE_CKPT_LOG_IO wait event represents other than to state that it relates to reads on the log file required in order to perform offline check points. Because a row shows up in sysprocesses for a thread waiting on this wait type that accumulates CPU time whilst the in memory table is not in use, WAIT_XTP_OFFLINE_CKPT_LOG_IO can probably be safely ignored.
Can The Throughput Curve Be Pushed Still Further ?
I’m in the process of writing a continuation of this post involving the in memory OLTP engine, this allows the LMax disruptor pattern to scale to over 550,000 messages pushes per second:
Wrap Up
In this post I’ve dug into using the lmax disruptor pattern with the legacy database engine and addressed several issues which inhibit its scalability:
- log write performance
- page latch contention
- sequence generation scalability
Of these points, the last one is worthy of its own blog post. I suspect that there is a point at which the in memory table technique hits a scalability brick wall due to the sheer volume of DBTABLE spinlock spins. I also hope to cover the topic of queuing when using the in memory OLTP engine.
> This approach might scale well with in-memory OLTP engine
Actually, it should have the same kind of scalability issues to a lesser extent. You can’t build a queue out of a hash table so you need a range index. And there should be contention on the last page of it.
Also, CAS on the same data does not scale perfectly, either. CAS is merely a functionally constrained hardware lock.
Great work in this blog post. Would be interesting to see the throughput curves on a 128 core box. Probably, the perfect scalability is no longer present.
If I can address each of your comments in turn:
1. You can’t build a queue on a hash table.
If you were building your own hash table and if your choice of hashing algorithm was a poor one, this would cause problems. However, if we are talking about getting the database engine to do the leg work for us, via the in memory OLTP engine which uses hash buckets I cannot see this being a problem. Also if a good implementation of a reverse key index behaves as I think it should this would work also, although bit reversion is not technically the same as a hashing algorithm, at a high level you are effectively doing the same thing as hashing values.
2. Also CAS on the same data does not scale perfectly.
Agreed, however because a CAS instruction is a single machine instruction ( which is obvious ) it is much more efficient than a latch in terms of the CPU cycles it burns compared to a latch.
3. Would be interesting to see the throughput curves on a 128 core box.
For the legacy engine I suspect a combination of heavy spin contention on the XDESMGR and LOGCACHE_ACCESS spinlocks would kill the scalability curve long before you got to 128 threads. For the in memory OLTP engine and natively compiled stored procedures, the XDESMGR is no longer used, however you would either hit high LOGBUFFER waits or heavy spin contention on the LOGCACHE_ACCESS spinlock first.
Thanks for you comments and taking the time to read this blog post, I appreciate that my stuff is quite niche, but its still nice to have an audience.
Your stuff is awesome and is at the far edge of SQL Server tuning competence in this niche.
1. At the time of writing the comment I did not understand that you are not using range accesses to the queue. Equality predicates are enough. Hash indexes can therefore be used.
This should have no contention whatsoever on the hash buckets, you are correct in that. But each retrieval will result in at least one cache miss because the write happened on a different core. Memory transfers should now be a machine-wide limiting resource. Unclear, when that comes into play in practice.
2. True. Just want to emphasize my point: Concurrent writes, including CAS, on the same cache line do not scale at all with the number of threads doing them. I don’t think this contradicts anything of what you have said.
3. Hm who knows… Would be a cool test to run. Maybe transaction aborts for sequence number generation will be the highest contention cost?! That is my guess.
Fascinating article, thanks! Looking forward to the next installment. A couple minor questions/comments:
* I interpreted Thomas Kejser’s pseudo-code to mean that both push and pop would try again at the same slot in cases where the queue is full (push) or empty (pop). Was your change to increment the slot in such cases intentional? Does that change impact performance?
* In a real-world situation, it seems that pop would be constantly polling (with a small delay), but a push would occur only as new messages are generated. I think there may be some pretty bad worst-case behaviors that can come along with incrementing the pop slot even if there is no message to pop. For one, if the queue ever becomes empty, pop will quickly move in front of push. Once/if push catches up, the messages that were pushed in the meantime will be processed out of order because the first message to be processed will be the one where push has finally closed the gap on pop. Another edge case is that if the queue becomes empty and then only a single message is pushed, it will be ~30 minutes before the message is processed (2 million times the 1 ms wait) as the pop slot is in front of the pushed message and has to cycle all the way around to reach it. I could be misinterpreting, but in any case I suppose these are the sorts of issues that would need to be ironed out before using the code “in anger” (a new phrase for me as a non-Brit!).
* I think the “CREATE SEQUENCE dbo.PushSequence” should read “CREATE SEQUENCE dbo.PopSequence” in the pop section of the code
Thanks for reading the post, “My change” on reflection was most probably an oversight ( bug ) that crept in in my haste to get the LMax disrupt-or pattern to scale. The pop procedure should ideally get the current value of the sequence from sys.sequences, however as this is approach is proven to have limited scalability, the use of the memory optimised table is recommended in which case IDENT_CURRENT should be used to the current identity value.
Geoff, Please refer to the addendum in the blog post for code pop code which is more fit for production purposes subject to testing.
The code in the addendum makes sense, thanks! (Sometimes it’s hard to take off my code reviewer hat, sorry about that.)
Overall, I love the blog posts and it’s great to see the level of detail of your analysis. Definitely inspirational for those of us who are very comfortable with basic query plan tuning and in some cases wait stats analysis, but not yet debugging or tuning at such an advanced level as these blog posts.
Thanks for reading my blog, I’m delighted people find my musings worth reading and informative.
Question regarding the use of a pseudo-hash on the slot value: this will speed up pushes under high traffic, as you prove, but won’t it slow down pops, particularly if they’re done in less-frequent batches? I.e., having to read lots of pages with a single message each vs. a small series of pages full of messages?
There is a trade off here between logical IO and page latching. The implicit assumption behind implementing a queue in this manner is that you messages will be pushed and popped at a high frequency, therefore the design errs on the side of reducing waits on pagelatch_ex at the expense of increasing logical IO. The other consideration is that as you increase the size of the queue you risk it spilling out of the CPU cache and thrashing the lookaside buffer (TLB).