This is probably one of the most unheralded new features of SQL Server 2016 which gets but a single bullet point in the CSS engineers “It Just Runs Faster” blog post series.
Querying sysprocesses and joining a few other objects to it reveals that the log writer is multi threaded in SQL Server 2016:
As always I am never satisfied with with reading about new features and accepting what they do on face value, you should know where this is going by now ! . . . So I decided to give this new feature a test drive with my singleton insert test. Here is the table I will insert into:
CREATE TABLE [dbo].[MyBigTable] ( [c1] [bigint] NOT NULL ,[c2] [datetime] NULL ,[c3] [char](111) NULL ,[c4] [int] NULL ,[c5] [int] NULL ,[c6] [bigint] NULL CONSTRAINT [PK_BigTable] PRIMARY KEY CLUSTERED ( [c1] ASC ) WITH ( PAD_INDEX = OFF ,STATISTICS_NORECOMPUTE = OFF ,IGNORE_DUP_KEY = OFF ,ALLOW_ROW_LOCKS = ON ,ALLOW_PAGE_LOCKS = ON) ON [FG_01] )
And this is the stored procedure that will be used to perform concurrent inserts into it via osstress (which comes with rml utilities):
CREATE PROCEDURE [dbo].[usp_InsertSpid] @RowsToInsert int = 500000 AS BEGIN SET XACT_ABORT ON; SET NOCOUNT ON; DECLARE @i INTEGER = 0 ,@base BIGINT = @@SPID * 10000000000 ,@PartId INTEGER; WHILE @i <= @RowsToInsert BEGIN BEGIN TRANSACTION INSERT INTO dbo.MyBigTable (c1) VALUES (@base + @i); COMMIT; SET @i += 1; END; END;
This is my test hardware:
I should also mention that all the work in this post involves the legacy row store engine and not the in-memory OLTP engine.
I have two types of storage; a Raid 0 stripe composed of 3 SanDisk 480Gb Extreme Pro drives, which according to DiskSpd gives the following performance for a workload which is 100% write intensive and composed of 60K writes:
and a single Samsung 512GB PCIe NVMe drive which gives us:
I have chosen the 60K write intensive workload because I want to illustrate what the NVMe drive can do in comparison with a high end consumer SSD for the best case scenario when writing to a transaction log.
Test Time !
The first test is to compare SQL Server 2014 SP1 ( CU 4 ) against SQL Server 2016 RC1 with full durability:
And now for the exact same test but with delayed durability:
To understand why we get this performance degradation with SQL Server 2016 RC1 three key parts of a transactions life cycle need to be understood along with the serialisation mechanisms that protect them:
- Transaction Creation and Removal
This call stack excerpt is taken from SQL Server 2014 SP1 CU4, the call immediately after sqlmin!XdesMgr::GetXdesId is for the XDESMGR spinlock (what you will see in sys.dm_os_spinlock_stats) for this function. In a nutshell sqlmin!XdesMgr::GetXdesId deals out transaction ids. The in-memory OTLP engine has a completely different mechanism for dealing out transaction ids, which I believe is based on the use of an atomic increment instruction and does not involve any spinlocks.
2. Copying Log Records Into The LOG Buffer
To recap on something I have used a lot in my blog posts, this is the logging infrastructure ( which I need to get around to adding the log pool to at some stage ):
The big takeaway point from this diagram is that there is a hell of a lot of things which happen from the point of log record generation to the point at which log records are hardened to the transaction log.
Here is the call stack excerpt for the LOGCACHE_ACCESS spinlock which serialises access to the log buffer, again this is taken from SQL Server 2014 SP1 CU4. Despite the claims of Microsoft’s marketing department that the in-memory OLTP engine is spinlock free, the LOGCACHE_ACCESS spinlock is still used, but only for the copying of undo information to the log buffer.
3. Hardening Log Records To Disk
When the log writer flushes log records to disk the LOGFLUSHQ spinlock ( again the name you will see in in sys.dm_os_spinlock_stats ) serialises access to the log flush queue:
And The Point Of All Of This Is . . . ?
If we understand the spinlocks associated with transaction id generation and logging we have a means of identifying pressure on these particular points of the engine. For the tests with delayed durability the spin activity on the key three spinlocks is:
My theory is that a combination of:
- The number of osstress insert threads
- The performance of the NVMe flash drive
- An optimal log write size of 60K
- The fact there is now more than one log writer thread, meaning we would get LOGFLUSHQ spins naturally without the perfect storm that has been created for putting this under pressure.
is creating the “Perfect storm” whereby the multiple log writers in SQL Server 2016 are contending on draining the log flush queue such that spins on the LOGFLUSHQ become a factor to such an extent that a “Back pressure” effect is created against the log buffer. Hence why the number of LOGCACHE_ACCESS spins is so much higher for SQL Server 2016 than it is for 2014. And in summary it is this spin activity that is sending performance backwards compared to SQL Server 2014.
Why Have Microsoft Done This ?
At this point in time I do not have an honest answer to this, however something that has caught my eye, is the function call highlighted below, this is present for SQL Server 2016 but not for 2014:
My hunch, and this is only a hunch, is that the multi threaded log writer may have been introduced as part of some wider piece of work to do with availability group optimisation. I know for a fact that a lot of engine tuning has gone into SQL Server 2016 as per this graph provided by Jimmy May from SanDisk:
And Remember !!!
If you like my work, places are still available for my SQL Bits precon.
“Group commit” usually refers to writing the log records of multiple transactions in one IO. I don’t see a relationship to AGs based on that word.
When you say “Usually refers”, which version of SQL Server are you referring to ?, can you provide me with any official Microsoft documentation or stack traces to support this claim, this function call did not appear in the stack trace for SQL Server 2014. There are white papers on the in-memory OLTP engine which refer to the grouping of records together for larger writes, but they make no mention of writes themselves being grouped together. I invite you to try and find the IsGroupCommitEnabled function in any version of SQL Server prior to 2016. Taken on face value this function name hints on something that needs to be enabled, by a trace flag ?, I don’t know, also its name seems similar to the Is . . . Enabled columns in sys.databases. Finally, as per the use of the word hunch twice, I tried to make the point that this was a tenuous link. However, it is entirely within the realms of possibility that Microsoft have implemented this to support the enhancement to the availability group wire protocol. Consider this, if you can commit on a replica at 1GB/s according to the work SanDisk have done with SQL Server 2016, it does not take a great mental leap to come to the conclusion that Microsoft may have done work in other areas, i.e. the log writer (which in their eyes at least) supports this, but again, to use the word for a third time, this is only a ‘Hunch’ and could be something or nothing, which to come full circle is the point I was trying to make.
Bob Dorr just published an article “SQL 2016 – It Just Runs Faster: Updated Scheduling Algorithms” https://blogs.msdn.microsoft.com/psssql/2016/04/01/sql-2016-it-just-runs-faster-updated-scheduling-algorithms/
Could it be that the “multi-threaded” is a result of the threads doing balancing of in scheduling resources. Apparently this scheduling algorithm is already in Azure since Mar 2014
You can take CPU core out of the affinity mask, this is where the log writer ** usually ** runs on versions of SQL Server prior to 2016, on 2016 it appears that the log writer ** usually ** uses cores 0 and 1 of CPU socket zero. If you have hyper-threading enabled this, it will appear that SQL Server 2016 is uses four logical CPUs. I suspect the problem is down to the fact that when you have more than one log writer running with delayed durability ( implicit transactions will achieve the same effect ) you have LOGFLUSHQ spinlock contention between the two log writer threads. As per my graphs in the blog post, the affect of this is not noticeable when using normal full durability, change to full durability, and the log writers are clearing the log flush queue down that fast that they begin to stress the LOGFLUSHQ spinlock, most people will not see this on SQL Server 2016, to reproduce this you need:
1. Low latency storage for the transaction log.
2. A scenario specifically constructed in order to achieve a high throughput of singleton inserts.
3. A reasonable number of CPU cores.
4. . . . and obviously SQL Server 2016
I suspect that Microsoft have not hammered SQL Server 2016 in the way I have . . .
You will notice that with my test on SQL Server 2014 the number of XDESMGR spins is higher than that for 2016, which makes me conclude that this is not the bottleneck inhibiting insert throughput.
I hope this explains the scenario better.
I’ve looked at your response again, my thoughts and this is 100% pure conjecture on my part, are that if you have more than one user database per instance, despite the fact that log records always have to be hardened in log sequence order, there should be no logical reason why the log flush queue cannot have more than one log writer when it is flushing data to the transaction logs of multiple databases. Again, this is 100% pure conjecture, I suspect that ( as per my wrap-up comments ) the multi threaded log writer might be part of a wider piece of work associated with the tuning of the availability group wire protocol in SQL 2016.
Chris,
I wouldn’t expect any great difference but have you tried this on the RC2 build yet?
Rc 2 is on the to do list
Now RC3 and this from CSS https://blogs.msdn.microsoft.com/psssql/2016/04/19/sql-2016-it-just-runs-faster-multiple-log-writer-workers/
I’ve tested this with RC3 with low latency storage, delayed durability and a test that is designed to stress the logging infrastructure its slower than 2014.
Besides your Connect item I would imagine that CSS knows all about your findings. Wonder how this performs if you could find a way of going back to 1 log writer?
I would hope it would perform as well as SQL Server 2014, having said that they have added more code around the log writer, therefore the code path might be longer. However, without getting windows performance toolkit out I can’t offer an opinion (as yet) as to what this extra code does.
Hi,
Yesterday, i installed 2k16 RC3 on my laptop where I have 4 cores (Hyper threading enabled) but I cannot see multi thread for log writing. May you please confirm the SQL Server version (RC0, RC1 or RC2 ?) which you used?
Thanks!
According to the CSS engineers blog post on this, you get one log writer thread per NUMA node, implying you need a machine with at least two CPU sockets to see this feature, this is why you cannot see the multiple log writer threads on your laptop. All laptops to the best of my knowledge only support a single CPU socket.
Thank you for correcting me! I completely missed it.
Chris,
Have you had a chance to try this with the RTM version?
I’m on the case !
Eagerly awaiting your RTM tests. I’m discussing your results with the dev team, as this seems like a backward step in perf under some circumstances that should be configurable to turn off. Good stuff as always!
Hi Paul,
My findings with the rtm version are no different to what I found with CTP versions of SQL 2016. As you have access to people at Microsoft that I do not I would be interested to know what specific problem Microsoft were trying to solve with it.
Chris
Hello Datatake,
first of all greetings for your excellent posts, I was not aware of your blog and it will become a must for me!
Maybe the multi threaded log writer has been developed togheter with the multi threaded redo log job? As far as I know one of the new feature of sql2016 is the improved performance of the redo job in an AlwaysOn scenario.
Chris,
According to https://blogs.msdn.microsoft.com/bobsql/2016/06/03/sql-2016-it-just-runs-faster-multiple-log-writer-workers/ this might have been fixed in CU1.
Hi Chris thanks for the heads up on this, Bob Ward mentioned this to me last week.
Chris,
Just wondering if you retested using CU1 yet?
Chris
Yes I have, the issue still stands however you get a much smother singleton insert throughput curve as you increase the number of threads.
BEGIN TRANSACTION
INSERT INTO dbo.MyBigTable (c1)
VALUES (@base + @i);
COMMIT;
sorry for the bit off-topic question, but are transactions really needed when you have just INSERTs (say with no triggers) or could one skip them altogether?
George this is only really for readability/ my personal preference, without the transaction block you still get a transaction, but an implicit one.
Indeed, assuming implicit transactions are on in db connection options
However without explicit/implicit transactions at all I was wondering if INSERTs are still atomic, and it seems they’re indeed (via internal “autocommit transactions” as mentioned at https://www.c-sharpcorner.com/UploadFile/84c85b/understanding-transactions-in-sql-server/ )