In this post I wanted to cover transaction log write performance in detail, some of the nuances of storage area networks that can work against you and how lowering write log latency can be more subtle than throwing faster disks at the problem.
The Basics: Virtual Log File Counts
Refer to these blog posts on how many virtual log files you should have and their impact on performance:
- Transaction Logs Too Many or Too Few
- Important Change To The VLF Creation Algorithm In Sql Server 2014
- Performance impact: a large number of virtual log files – Part I
- Performance impact: a large number of virtual log files – Part II
The Basics: Storage
Place you log files on the lowest latency storage possible, if RAID is involved this should preferably no be RAID 5.
The Basics: “Trimming The Fat”
There are two excellent SQL Skills blogs posts which cover ways of reducing the number of transaction log records generated:
Getting Log Records Into Log Cache To Begin With
Before a transaction log can service a commit (no delayed durability here) the logging information for the relevant transaction has to be present in the log buffer, access to this is controlled via the LOGCACHE_ACCESS spinlock. I’ve been unable to create pressure on this with the hardware at my disposal and OTLP workloads, but I have been able to do this with the parallel insert which came in with SQL Server 2014. Affinitizing the database engine away from core 0 on NUMA node 0, where the log writer usually runs allows the LOGCACHE_ACCESS spinlock cache line to be freed up without the additional cost of log writer thread context switches (blogged here):
Beyond The Basics, WRITELOG Waits Higher Than Average Log IO Stall !
Time For A Test Drive
- SQL Server 2014 SP1
- Windows server 2012 R2
- Dell T5500 with 2 x 6 core 2.4 Ghz Xeons (Westmere)
- 32GB 1600Mhz triple channel ECC memory
- Trace flags 1117,1118,2330 and 8008 used
- 1 x 2.4TB IoDrive 2 duo
- 2 x SanDisk 480GB Extreme Pro solid state drives
I’m going to repeat the tests I performed in my super scaling singleton inserts blog, however, rather than go through all the tuning steps of getting the test to perform 500,000 inserts per second, I’m going to jump straight to the test setup that achieves this.
This is the table that is inserted into:
CREATE TABLE [dbo].[MyBigTable]( [c1] [bigint] NOT NULL ,[c2] [datetime] NULL CONSTRAINT [DF_BigTable_c2] DEFAULT (getdate()) ,[c3] [char](111) NULL CONSTRAINT [DF_BigTable_c3] DEFAULT ('a') ,[c4] [int] NULL CONSTRAINT [DF_BigTable_c4] DEFAULT ((1)) ,[c5] [int] NULL CONSTRAINT [DF_BigTable_c5] DEFAULT ((2)) ,[c6] [bigint] NULL CONSTRAINT [DF_BigTable_c6] DEFAULT ((42)) ,CONSTRAINT [PK_BigTable] PRIMARY KEY CLUSTERED ( [c1] ASC ) WITH ( PAD_INDEX = OFF ,STATISTICS_NORECOMPUTE = OFF ,IGNORE_DUP_KEY = OFF ,ALLOW_ROW_LOCKS = ON ,ALLOW_PAGE_LOCKS = OFF) ON [FG_DEST] ) ON [FG_DEST]
Before kicking off the concurrent inserts, the table will be seeded using this statement, there are a couple of other statements to ensure the buffer cache is cold and all the VLFs in the transaction log are free:
WITH t4 AS (SELECT n FROM (VALUES(0),(0),(0),(0)) t(n)) ,t256 AS (SELECT 0 AS n FROM t4 AS a CROSS JOIN t4 AS b CROSS JOIN t4 AS c CROSS JOIN t4 AS d) ,t16M AS (SELECT ROW_NUMBER() OVER (ORDER BY (a.n)) AS num FROM t256 AS a CROSS JOIN t256 AS b CROSS JOIN t256 AS c) INSERT INTO dbo.MyBigTable WITH(TABLOCKX) SELECT CAST(CRYPT_GEN_RANDOM(8) AS BIGINT) ,GETDATE() ,'a' ,1 ,2 ,42 FROM t16M; CHECKPOINT DBCC DROPCLEANBUFFERS BACKUP LOG [AdventureWorksDW2014] TO DISK='NUL:'
Finally, this is the stored procedure that performs the actual inserts:
CREATE PROCEDURE [dbo].[usp_Insert] @RowsToInsert int = 2000000 AS BEGIN SET NOCOUNT ON; DECLARE @i INTEGER = 0 ,@k BIGINT = 0 ,@base BIGINT = @@SPID * 10000000000; WHILE @i < @RowsToInsert BEGIN BEGIN TRANSACTION INSERT INTO dbo.MyBigTable (c1) VALUES (@base + @k) ,(@base + @k + 1); SET @k += 2; COMMIT; SET @i += 2; END; END;
Results From Conventional Tools
These are the wait statistics for this test:
sys.dm_io_virtual_file_stats reports that the average write stall for the transaction log ( total write io stall time / total writes) is 0.1 ms. The perfmon counters which come with SQL Server tell us that we are averaging around 15,000 log flushes per second and on average 100MB/s of logging information is being flushed to ‘Disk’ per second:
Conventional tools cannot tell us what is happening down at micro second granularity, which is important considering that the io drive is has a sub ms write latency.
What Can Windows Performance Toolkit Tell Us ?
Using a set of xperf flags I mentioned in a previous blog post:
xperf -on PROC_THREAD+LOADER+DISK_IO+DISK_IO_INIT
we can capture the:
- Call stack that initiates writes to any file
- Elapsed time in micro seconds of the call stack that initiated the IO
- Elapsed time in micro seconds of servicing the IO request on the device itself
In this case the file of interest is the transaction log:
The next generation of non-volatile memory promises access latency on par with that of main memory, when this comes to pass the time expended in the transaction log write call stack could dwarf that spent on the storage device.
The transaction log write call stack is considerable, in comparison SQL OS schedulers use just two non SQL Server DLL’s (an apples to potatoes comparison I know):
The most CPU intensive functions invoked in sqlmin.dll are:
SQLServerLogMgr::AppendLogRequest is the function used for copying logging information into the log buffer, XdesRMReadWrite::GenerateLogRec produces the logging information in the first place. Two function calls account for the bulk of the CPU time expended involving the spinlock:
sqlmin.dll!XdesMgr::IsMoreThanOneWriterActive checks to see if there is more than one thread active that has performed a commit the log flush has been issued for (if my reading of the stack is correct):
sqlmin.dll!XDesMgr::GetXdesId serves out transaction ids, for obvious reasons there has to be a mechanism associated with this which enforces serialization:
The LOGFLUSHQ Spinlock
SQL Server has a finite number of log writes it can queue prior to being flushed to disk, 32 for SQL Server 2008/2008R2 and 212 for SQL Server 2012 onward. The LOGFLUSHQ spinlock is acquired in order to ‘Place’ (copy by reference) log entries in the log queue. Pressure on this spinlock would indicate that the log writer cannot drain the write queue fast enough.
The SQLCAT post Diagnosing Transaction Log Performance Issues and The Limits Of The Log Manager provides this query for monitoring pending transaction log io:
SELECT vfs.database_id ,df.name ,df.physical_name ,vfs.file_id ,ior.io_pending ,ior.io_handle ,vfs.file_handle FROM sys.dm_io_pending_io_requests ior JOIN sys.dm_io_virtual_file_stats (DB_ID(), NULL) vfs ON (vfs.file_handle = ior.io_handle) JOIN sys.database_files df ON (df.file_id = vfs.file_id) WHERE df.name = '[Logical Log File Name]'
If the number of rows returned by this query is consistently equal to 32 for SQL Server 2008/2008 R2 or 112 (Thanks to Paul Randal for correcting me on this, it is not 212) for SQL Server 2012 onward (highly unlikely) you are hitting the limit of the log flush queue and the main reason behind why the transaction log average write stall can be less than the average WRITELOG wait.
Of all the storage architectures; storage area networks (SANs) in particular present several challenges when it comes to achieving low latency writes:
#1 Latency of the path from server CPU to physical disk
This is the typical data path between CPU and the disks in a SAN, for direct attached storage the SAN fabric is replaced by a direct connection:
Transmitting any signal down a wire carries an overhead, add to this the latency chain of going through the host bus adapter, storage fabric, SAN ports, storage processor and whatever links this to the disk shelves. You cannot defy the laws of physics by putting any signal down any cable and not paying a latency penalty. With a flash PCIe card the only part of the latency chain is the PCIe bus itself.
#2 Write Cache Behavior
On certain models of SAN, IO can be quiesced until the write part of the storage processor cache is flushed to disk once it becomes full.
#3 IO Re-ordering / Scheduling
Storage processors and / or host bus adapters may re-order IO requests in order to achieve higher IO throughput. The elevator sort algorithm that can be turned on with HP Smart Array controllers is a prime example of this, the difference it makes to latency and throughput is illustrated below (image courtesy of Thomas Kejser):
#4 IO Segregation Down To Disk Level
More often than not SANs do not allow you to isolate IO to a specific set of disks, when spinning disks are provisioned from a consolidated pool transaction log write, performance can suffer when the same disks get swamped by a checkpoint. Joe Chang has blogged about this at length here.
#5 Automatic Tiering
Some SANs have the ability to automatically place hot files onto faster storage tiers, the problem with this is the algorithm that determines what is ‘Hot’. For example, if a file receives a spike in IO and the intelligence built into the array requires a days worth of stats with which to determine whether the file should be moved onto faster storage, this is too late to benefit the spike in IO that took place, also the stats can average out the spike in IO demand.
And Now For The Good News
As I write this post, a lot of the current limitations of SAN’s will become historical. Storage is at its greatest inflection point since the the worlds first commercial hard drive was released, the IBM Ramac. The cost per GB for flash storage is dropping to around 1~1.5$, as such we are now in the era of the all flash array, all of the following will die out:
- short stroking
- elevator sort algorithms to improve throughput at the expense of latency
- compression to bridge the performance gap between CPU’s, this will still be relevant to all flash arrays, but only in the context of a means to increase their storage capacity and not in the context of better performance
- storage tiering
Performance concerns will focus on the fabric used to talk to the storage and low latency CPU cycle efficient protocols, due to this I can see infiniband and the non-volatile memory express protocol becoming very popular.
The New Storage Protocol on The Block: Non-Volatile Memory Express (NVMe)
When Fusion IO first launched they were ahead of the of the market with two key things, firstly they realized they could achieve lower latency and higher throughput if servers talked to NAND flash on PCIe cards. Secondly, Fusion IO realized that the legacy storage protocols of the time (SAS and SATA) could not exploit the parallelism of the PCIe bus, in effect they pre-empted the non-volatile memory express protocol with their own bespoke virtual storage layer protocol. The image below (reproduced from this anandtech article) illustrates the efficiency in CPU cycles of NVMe versus the conventional IO stack used by SAS and SATA devices:
The evolution of this is NVMe over fabrics.
My io drive does not use the NVMe protocol, it uses the “IO memory virtual storage layer”, we can see how efficient it is in terms of the sampled CPU time it has consumed compared to the rest of the database engine:
The Future, SMB 3.0, Windows Scale Out File Server and NVMe
The “Server Message Block” (SMB) protocol, Windows scale out file server and non volatile memory express (NVMe) are likely to be popular in storage architectures of the near future. Three architectures are illustrated below which use these technologies:
- Storage presented to SQL Server via Scale Out File Server and local NVMe flash storage in the storage tier
- Storage presented to SQL Server via Scale Out File Server and pooled flash storage accessed via NVMe over fabrics
- End to end use of NVMe over Fabrics
When All Else Fails
When you have extinguished all available options and are hitting the limits of the log writer there are only two places you can really go:
- Delayed durability with SQL Server 2014 onward
Using this involves the hard decision of choosing performance over durability.
- ‘Shard’ The Database
Split the database up across multiple instances with each instance bound to a CPU socket. Prior to Windows Server 2012 R2, Windows System Resource Manager would be one means of achieving this, thereafter the only option is to use Hyper-V or VMWare to create a virtual machine per CPU socket. There are two sticking points here, firstly your data model must lend itself to being shard-ed easily, secondly would workload needs to be such that it falls evenly across shards.
Achieving optimal write log performance is more nuanced than the requirement for fast disks and covers:
- Whether you are bottle necked by getting logging records into the log buffer in first place.
- Is the log writer queue being saturated ?
- The length of the latency chain from a servers CPUs to the physical disk or flash modules.
- Anything in the storage infrastructure likely to throttle latency, such as the write cache filling up and IO being stalled until it is flushed, is there some form of ‘Intelligence’. . . and I use the term loosely that is queuing, aggregating and re-ordering IO because ‘It’ thinks that achieving high throughput at the expense of latency is the desired performance goal.
This post contains images borrowed from Thomas Kejser’s Master Tuning course slide deck with his permission.