This is probably one of the most unheralded new features of SQL Server 2016 which gets but a single bullet point in the CSS engineers “It Just Runs Faster” blog post series.
Querying sysprocesses and joining a few other objects to it reveals that the log writer is multi threaded in SQL Server 2016:
As always I am never satisfied with with reading about new features and accepting what they do on face value, you should know where this is going by now ! . . . So I decided to give this new feature a test drive with my singleton insert test. Here is the table I will insert into:
CREATE TABLE [dbo].[MyBigTable] ( [c1] [bigint] NOT NULL ,[c2] [datetime] NULL ,[c3] [char](111) NULL ,[c4] [int] NULL ,[c5] [int] NULL ,[c6] [bigint] NULL CONSTRAINT [PK_BigTable] PRIMARY KEY CLUSTERED ( [c1] ASC ) WITH ( PAD_INDEX = OFF ,STATISTICS_NORECOMPUTE = OFF ,IGNORE_DUP_KEY = OFF ,ALLOW_ROW_LOCKS = ON ,ALLOW_PAGE_LOCKS = ON) ON [FG_01] )
And this is the stored procedure that will be used to perform concurrent inserts into it via osstress (which comes with rml utilities):
CREATE PROCEDURE [dbo].[usp_InsertSpid] @RowsToInsert int = 500000 AS BEGIN SET XACT_ABORT ON; SET NOCOUNT ON; DECLARE @i INTEGER = 0 ,@base BIGINT = @@SPID * 10000000000 ,@PartId INTEGER; WHILE @i <= @RowsToInsert BEGIN BEGIN TRANSACTION INSERT INTO dbo.MyBigTable (c1) VALUES (@base + @i); COMMIT; SET @i += 1; END; END;
This is my test hardware:
I should also mention that all the work in this post involves the legacy row store engine and not the in-memory OLTP engine.
I have two types of storage; a Raid 0 stripe composed of 3 SanDisk 480Gb Extreme Pro drives, which according to DiskSpd gives the following performance for a workload which is 100% write intensive and composed of 60K writes:
and a single Samsung 512GB PCIe NVMe drive which gives us:
I have chosen the 60K write intensive workload because I want to illustrate what the NVMe drive can do in comparison with a high end consumer SSD for the best case scenario when writing to a transaction log.
Test Time !
The first test is to compare SQL Server 2014 SP1 ( CU 4 ) against SQL Server 2016 RC1 with full durability:
And now for the exact same test but with delayed durability:
To understand why we get this performance degradation with SQL Server 2016 RC1 three key parts of a transactions life cycle need to be understood along with the serialisation mechanisms that protect them:
- Transaction Creation and Removal
This call stack excerpt is taken from SQL Server 2014 SP1 CU4, the call immediately after sqlmin!XdesMgr::GetXdesId is for the XDESMGR spinlock (what you will see in sys.dm_os_spinlock_stats) for this function. In a nutshell sqlmin!XdesMgr::GetXdesId deals out transaction ids. The in-memory OTLP engine has a completely different mechanism for dealing out transaction ids, which I believe is based on the use of an atomic increment instruction and does not involve any spinlocks.
2. Copying Log Records Into The LOG Buffer
To recap on something I have used a lot in my blog posts, this is the logging infrastructure ( which I need to get around to adding the log pool to at some stage ):
The big takeaway point from this diagram is that there is a hell of a lot of things which happen from the point of log record generation to the point at which log records are hardened to the transaction log.
Here is the call stack excerpt for the LOGCACHE_ACCESS spinlock which serialises access to the log buffer, again this is taken from SQL Server 2014 SP1 CU4. Despite the claims of Microsoft’s marketing department that the in-memory OLTP engine is spinlock free, the LOGCACHE_ACCESS spinlock is still used, but only for the copying of undo information to the log buffer.
3. Hardening Log Records To Disk
When the log writer flushes log records to disk the LOGFLUSHQ spinlock ( again the name you will see in in sys.dm_os_spinlock_stats ) serialises access to the log flush queue:
And The Point Of All Of This Is . . . ?
If we understand the spinlocks associated with transaction id generation and logging we have a means of identifying pressure on these particular points of the engine. For the tests with delayed durability the spin activity on the key three spinlocks is:
My theory is that a combination of:
- The number of osstress insert threads
- The performance of the NVMe flash drive
- An optimal log write size of 60K
- The fact there is now more than one log writer thread, meaning we would get LOGFLUSHQ spins naturally without the perfect storm that has been created for putting this under pressure.
is creating the “Perfect storm” whereby the multiple log writers in SQL Server 2016 are contending on draining the log flush queue such that spins on the LOGFLUSHQ become a factor to such an extent that a “Back pressure” effect is created against the log buffer. Hence why the number of LOGCACHE_ACCESS spins is so much higher for SQL Server 2016 than it is for 2014. And in summary it is this spin activity that is sending performance backwards compared to SQL Server 2014.
Why Have Microsoft Done This ?
At this point in time I do not have an honest answer to this, however something that has caught my eye, is the function call highlighted below, this is present for SQL Server 2016 but not for 2014:
My hunch, and this is only a hunch, is that the multi threaded log writer may have been introduced as part of some wider piece of work to do with availability group optimisation. I know for a fact that a lot of engine tuning has gone into SQL Server 2016 as per this graph provided by Jimmy May from SanDisk:
And Remember !!!
If you like my work, places are still available for my SQL Bits precon.