One of my missions of late has been to obtain some time on a “Big box” in order to see how certain spinlocks behave under pressure, despite my best efforts I am not experiencing much luck on this front. However, there is something I can do with the two socket servers I already do have. Let me outline the principle I wish to demonstrate, I’m going to create a very simple heap, the simplest possible:
CREATE TABLE [dbo].[InsertTest] ( [COL1] [int] NULL )
and I am going to insert values into it using a very simple block of T-SQL:
SET NOCOUNT ON BEGIN DECLARE @i int = 0; WHILE @i < 1000000 BEGIN BEGIN TRANSACTION INSERT INTO InsertTest VALUES(@i); COMMIT; SET @i += 1; END; END;
I should mention that delayed durability is not being used.
The log writer invariably lives on core 0 of the first CPU socket. With modern multi socket hardware the terms ‘Socket’ and NUMA node are inter changeable most of the time.
Does It Matter In Terms Of Performance Where The Insert Runs ?
Lets test this, using the CPU affinity mask I can alter the logical processors which are available to the database engine, because I have hyper threading enabled, the first two logical processors on NUMA node 0 correspond to the first physical core on the socket, logical processors 10 and 11 correspond to core 5 (note that all numbering starts from 0) etc:
What do sys.dm_os_wait_stats and sys.dm_os_spinlock_stats have to say ?
CPU Socket 0 Core 0:
CPU Socket 1 Core 0:
Clearly the insert runs the fastest on core 0 of the first CPU socket, the wait stats show that writelog waits only become evident when the insert runs on socket 1. Conventional tools will only shed so much light on this, therefore it’s time to use windows performance toolkit:
CPU Socket 0 Core 0
CPU Socket 0 Core 5
CPU Socket 1 Core 0
A Cache Line and Cache Entry Primer
The unit of transfer of data from main memory to the CPU is the “Cache line”, this is usually 64 bytes in size:
Once cache lines are loaded into the CPU cache they become cache entries.
Spinlocks, Log Buffers and The CPU Cache
Any piece of data used by a CPU has to appear as a cache entry in the CPU cache during its life cycle, when a thread needs to write logging data into a log buffer. Both the log buffer and the cache entry associated with LOGCACHE_ACCESS spinlock need to travel from core 0 on CPU socket 0 to the local CPU cache of the core that the thread generating logging information is running on, in order that the thread can use the log buffer. The thread will write to both the LOGCACHE_ACCESS spinlock and the log buffer before it is returned to the CPU core that the log writer is running on. ( Spoiler alert ! 😉 ) the ‘Distance’ a cache entry has to travel, i.e.:
- The log buffer and LOGCACHE_ACCESS cache entries are already in the L2 cache of the thread generating the logging information and therefore no travel is required at all.
- It has to travel from CPU core to CPU core via the L3 cache.
- It has to travel from CPU socket to CPU socket via the quick path interconnect
has major performance ramifications.
By running the insert thread as closely as possible to the CPU core the log writer thread is running on we can increase the write throughput to the transaction log, by:
- 100% when the insert thread and log writer thread reside on the same CPU core
- 50~60% when the insert thread and the log writer thread are running on the same CPU socket
this is when compared to running the two threads on different CPU sockets.
This illustrates the overhead in having more CPU cores and sockets, but why exactly ?. Note that in the second test the number of spins on the LOGCACHE_ACCESS spinlock goes up from 24,192 to 1,710,336. this is due the LOGCACHE_SPINLOCK cache entry having to travel across the level 3 cache from core to core, this takes longer than the cache entry being shared via the level 2 cache, hence the increased number of spins. When the insert thread is on a different socket the overhead in the cache entry associated with the log buffer having to travel from CPU socket to socket via the quick path interconnect becomes noticeable which is why spins on the LOGFLUSHQ spinlock increase.
We can see that when the insert thread is closer to the log writer, as per the graph above, it burns more CPU cycles because its being made to work harder, evident by the IO throughput graphs from earlier on.
What This Exercise Ultimately Boils Down To
Processing without having to go off the CPU socket for data is the undisputed performance king !
But In A Previous Blog Post You Advised Taking CPU Socket 0 Core 0 Out Of The CPU Affinity Mask ?
Yes I did, if you are experiencing a heavy amount of spin and back off activity on the LOGCACHE_ACCESS spinlock this advice still holds true, so far with the hardware at my disposal I’ve been unable to generate any pressure of note on this spin lock which is why I will be investing in new hardware next year.
How Exactly Does This Help Me ?
This test illustrates the performance penalty paid incurred by the log writer infrastructure when we go off CPU socket, the inference being that the best OLTP performance can be achieved on a single socket machine with the fastest clocked CPU cores available and a large CPU cache. The perfect storm would be everything to run on the first CPU core of the first CPU socket, however this would only be practical for the smallest of OLTP workloads.
One thought on “Running SQL Server On A “Big Box”: Quantifying The Effect Of NUMA on OLTP Workloads: Part 1”