In this post I would like to dig deeper into the NUMA awareness of the SQL Server batch engine. Firstly, can we infer anything from the SQL OS method calls with ‘NUMA’ in their name, based on this stack trace excerpt from my single socket laptop, the answer is an emphatic ‘No’:
I therefore turned my attention to different methods for disproving / proving NUMA awareness in the batch engine.
How Data Is Delivered Into The Column Store Object Pool
Data is delivered to the column store object pool via the buffer pool, memcpy a C/C++ function copies data from a source location in memory to a destination location, the column store object pool in this case.
The column store object pool stores BLOB pages contiguously as per this except from the white paper “Enhancements To SQL Server Column Stores“:
Each object is stored contiguously on adjacent memory pages. This simplifies and speeds up scanning of a column because there are no “page breaks” to worry about.
This creates the best possible conditions for the CPU memory pre-fetcher to do its job and in doing so ensure that expensive trips are to main memory are minimized.
OLTP and NUMA, Locality, Locality, Locality !!!
Before I dive any deeper into my investigation as to whether or not the batch engine is NUMA aware, lets take a step back and look at the NUMA aware behavior we want to see for OLTP applications:
and this is want we don’t want to see:
SQL OS is a completely different beast when it comes to scheduling threads for parallel queries, the CSS Engineers blog post “NUMA Connection Affinity and Parallel Queries” is recommended reading on the subject, note this excerpt which comes from the blog post:
If the DOP for the query can be satisfied by a single node the ideal node (most workers, schedulers and memory available at the time the query is started) is selected. The controlling worker can be on a different node than the ideal. For example you could have NUMA port affinity established for NODE 1 but the parallel query executes on NODE 4 because 4 was the ideal node. This is not a problem because the majority of the work is done by the parallel workers which are located together on a node.
“Available memory” is referred to but there is nothing which mentions thread and NUMA node local memory affinity.
Knobs and Dials For Controlling and Leveraging NUMA
- Enabling interleaved memory in the server’s BIOS turns NUMA off.
- Some server BIOSes allow individual CPU sockets to be turned off.
- Trace flags 839 and 8015, these force SQL OS to treat the buffer pool as a single pool of memory and ignore hardware NUMA, this is covered in this CSS engineers blog post.
- NUMA port affinity, this is more applicable to the world of OLTP than OLAP.
This Is All Very Good, But Is My Server NUMA Aware ?
If the option to view CPU utilization per NUMA node is not grayed out, your machine has NUMA enabled, the screen shot below was taken from my laptop, hence the option to view CPU utilization per NUMA node is disabled:
With hyper-threading disabled the relationship between physical CPU cores and logical processors is 1:1 and this is what we see in task manager:
Enabling hyper-threading results in the operating system seeing two logical processors per physical CPU core, changing what we see in task manager to:
If we create a column store that can fit inside each CPU socket’s bank of memory and run a simple query in which we ‘Warm’ the column store object pool on one CPU socket:
and then run the same query against the second CPU socket (using the CPU affinity mask to make each socket available/unavailable):
This will reveal whether the column store object pool stores pages in NUMA node zero’s bank of memory or whether the column store’s pages are distributed across both CPU socket’s banks of memory. If we were investigating the buffer cache in isolation there are a couple of performance counters which would help:
Unfortunately there is nothing I can see which exposes remote and local node memory access statistics for the column store object pool, at least not in SQL Server 2014.
My test server is a Dell T5500 precision work station with:
- Windows server 2012 R2 installed
- SQL Server 2014 enterprise edition CU 7
- 2 sockets, each populated with a 2.4Ghz 6 core Xeon (Westmere micro architecture).
- 32GB 1600Mhz triple channel memory
- 1 x Fusion IO 2.4TB ioDrive duo.
- 2 x 480GB SanDisk Extreme solid state drives
Test Data Creation
First of all I am going to create a column store that can fit inside one bank of memory, note that each socket has 2 x 8GB DIMMs. The rationale for doing this is to work out whether the batch engine is going to attempt to fit this into a column store object pool on one bank of memory or distribute it across the two:
WITH generator AS ( SELECT TOP 250000 id = ROW_NUMBER() OVER (ORDER BY a) FROM (SELECT a = 1 FROM master.dbo.syscolumns) c1 CROSS JOIN master.dbo.syscolumns c2) SELECT a.DateKey AS OrderDateKey ,CAST(((id - 1) % 1048576) AS bigint) AS BigColumn1 ,CAST(((id - 1) % 1048576) AS bigint) AS BigColumn2 ,CAST(((id - 1) % 1048576) AS bigint) AS BigColumn3 INTO BigDataColumnStore FROM generator CROSS JOIN dbo.DimDate a; CREATE CLUSTERED INDEX ccsi ON BigDataColumnStore ( OrderDateKey ) CREATE CLUSTERED COLUMNSTORE INDEX ccsi ON BigDataColumnStore WITH DROP_EXISTING
Our test query uses the DimDate table from the AdeventureWorksDW database:
SELECT d.CalendarQuarter ,SUM([BigColumn1]) ,SUM([BigColumn2]) ,SUM([BigColumn3]) FROM [dbo].[BigDataColumnStore] f JOIN [dbo].[DimDate] d ON d.DateKey = f.OrderDateKey GROUP BY d.CalendarQuarter OPTION (MAXDOP 6)
Here are the results of this test, first if the column store object pool is warmed on node 0, then the CPU affinity mask altered such that the query runs on node 1, this is what we get:
Remembering that there is an additional overhead in accessing memory on a foreign NUMA node, this implies that a significant proportion (if not all) of the pages belonging to the column store are stored in NUMA node 0’s local memory. Given free reign to schedule worker threads as it sees fit and an instance on which no other sessions are running, where will SQL OS elect to schedule the worker threads ?. With the CPU affinity mask set to its default, all the worker threads run on NUMA node 0 when the query is ran for the first time:
when the query is ran for a second time, NUMA node 1 is now used:
Every subsequent time the query is ran, SQL OS will use a different NUMA node.
I would like to thank Joe Chang for giving me a nudge in the right direction when investigating this.
Batch Hash Joins and “Cache Out”
Most people will be familiar the penalty paid when an iterator (hash join, sort etc) spills out of memory and into tempdb, “Cache out” is the generic term for spilling out of the various caches and memory as we travel from the CPU to storage:
This is the way the batch engine performs hash joins according to a reliable source:
The row store engine partitions hash tables across NUMA nodes, which I suspect to be the case the when the worker threads cannot fit onto a single NUMA node, conversely the batch engine will affinitize the hash table to a single NUMA node. If this is genuinely the case, a single hash table affinitized to a single CPU socket will cache out before the row mode equivalent partitioned across two CPU sockets does.
The size of the dimension table being joined to the column store fact can be increased up to the point that level 3 cache out takes place, causing the number of CPU cycles per row for the hash join iterator should jump up, there are several key points to take away here:
- All memory access is CPU intensive
- Accessing main memory is more CPU intensive than accessing memory pages inside the CPU cache:
Why I Like Windows Performance Toolkit So Much
SQL Server 2014 provides an extended event which allows the CPU cycles per row to be captured for the hash aggregate iterator:
This is great but I want to capture CPU utilization statistics for the straight hash join iterator also. Herein lies the problem with the standard SQL Server tooling, you are limited by what the SQL Server development team at Microsoft wish to expose. If the public symbols are available for the release of SQL Server you are using, CPU utilization across the entire database engine can be captured and explored in windows performance analyzer.
This blog post should appear in the not too distant future, I hope you have enjoyed reading this one.