Under The Hood Of The Batch Engine: NUMA Support Part 2

In this post I would like to dig deeper into the NUMA awareness of the SQL Server batch engine. Firstly, can we infer anything from the SQL OS method calls with ‘NUMA’ in their name, based on this stack trace excerpt from my single socket laptop, the answer is an emphatic ‘No’:


I therefore turned my attention to different methods for disproving / proving NUMA awareness in the batch engine.

How Data Is Delivered Into The Column Store Object Pool

cstore obj pool

Data is delivered to the column store object pool via the buffer pool, memcpy a C/C++ function copies data from a source location in memory to a destination location, the column store object pool in this case.

The column store object pool stores BLOB pages contiguously as per this except from the white paper “Enhancements To SQL Server Column Stores“:

Each object is stored contiguously on adjacent memory pages. This simplifies and speeds up scanning of a column because there are no “page breaks” to worry about.

This creates the best possible conditions for the CPU memory pre-fetcher to do its job and in doing so ensure that expensive trips are to main memory are minimized.

OLTP and NUMA, Locality, Locality, Locality !!!

Before I dive any deeper into my investigation as to whether or not the batch engine is NUMA aware, lets take a step back and look at the NUMA aware behavior we want to see for OLTP applications:

Numa works

and this is want we don’t want to see:

NUMA failsSQL OS is a completely different beast when it comes to scheduling threads for parallel queries, the CSS Engineers blog post “NUMA Connection Affinity and Parallel Queries” is recommended reading on the subject, note this excerpt which comes from the blog post:

If the DOP for the query can be satisfied by a single node the ideal node (most workers, schedulers and memory available at the time the query is started) is selected.  The controlling worker can be on a different node than the ideal.  For example you could have NUMA port affinity established for NODE 1 but the parallel query executes on NODE 4 because 4 was the ideal node.   This is not a problem because the majority of the work is done by the parallel workers which are located together on a node.

“Available memory” is referred to but there is nothing which mentions thread and NUMA node local memory affinity.

Knobs and Dials For Controlling and Leveraging NUMA

  • Enabling interleaved memory in the server’s BIOS turns NUMA off.
  • Some server BIOSes allow individual CPU sockets to be turned off.
  • Trace flags 839 and 8015, these force SQL OS to treat the buffer pool as a single pool of memory and ignore hardware NUMA, this is covered in this CSS engineers blog post.
  • NUMA port affinity, this is more applicable to the world of OLTP than OLAP.

This Is All Very Good, But Is My Server NUMA Aware ?

If the option to view CPU utilization per NUMA node is not grayed out, your machine has NUMA enabled, the screen shot below was taken from my laptop, hence the option to view CPU utilization per NUMA node is disabled:

numa awareWhich Logical Processors Belong To Which NUMA Node ?

With hyper-threading disabled the relationship between physical CPU cores and logical processors is 1:1 and this is what we see in task manager:


Enabling hyper-threading results in the operating system seeing two logical processors per physical CPU core, changing what we see in task manager to:



If we create a column store that can fit inside each CPU socket’s bank of memory and run a simple query in which we ‘Warm’ the column store object pool on one CPU socket:

Foreign Node 1

and then run the same query against the second CPU socket (using the CPU affinity mask to make each socket available/unavailable):

Foreign Node 1

This will reveal whether the column store object pool stores pages in NUMA node zero’s bank of memory or whether the column store’s pages are distributed across both CPU socket’s banks of memory. If we were investigating the buffer cache in isolation there are a couple of performance counters which would help:

buf counters

Unfortunately there is nothing I can see which exposes remote and local node memory access statistics for the column store object pool, at least not in SQL Server 2014.

Test Server test hardware

My test server is a Dell T5500 precision work station with:

  • Windows server 2012 R2 installed
  • SQL Server 2014 enterprise edition CU 7
  • 2 sockets, each populated with a 2.4Ghz 6 core Xeon (Westmere micro architecture).
  • 32GB 1600Mhz triple channel memory
  • 1 x Fusion IO 2.4TB ioDrive duo.
  • 2 x 480GB SanDisk Extreme solid state drives

Test Data Creation

First of all I am going to create a column store that can fit inside one bank of memory, note that each socket has 2 x 8GB DIMMs. The rationale for doing this is to work out whether the batch engine is going to attempt to fit this into a column store object pool on one bank of memory or distribute it across the two:

WITH generator AS (
    FROM       (SELECT a = 1
                FROM   master.dbo.syscolumns) c1
    CROSS JOIN master.dbo.syscolumns c2)
SELECT      a.DateKey                            AS OrderDateKey
           ,CAST(((id - 1) % 1048576) AS bigint) AS BigColumn1
           ,CAST(((id - 1) % 1048576) AS bigint) AS BigColumn2
           ,CAST(((id - 1) % 1048576) AS bigint) AS BigColumn3
INTO       BigDataColumnStore
FROM       generator
CROSS JOIN dbo.DimDate a;

CREATE CLUSTERED INDEX ccsi ON BigDataColumnStore ( OrderDateKey )


Our test query uses the DimDate table from the AdeventureWorksDW database:

SELECT    d.CalendarQuarter
FROM     [dbo].[BigDataColumnStore] f
JOIN     [dbo].[DimDate] d
ON       d.DateKey = f.OrderDateKey
GROUP BY d.CalendarQuarter

Test Results

Here are the results of this test, first if the column store object pool is warmed on node 0, then the CPU affinity mask altered such that the query runs on node 1, this is what we get:

Node 0 firstOut of curiosity I tried this from the opposite direction by warming the column store on node 1 and then changing the CPU affinity mask such that all the query worker threads   ran on node 0:

Node 1 first

Remembering that there is an additional overhead in accessing memory on a foreign NUMA node, this implies that a significant proportion (if not all) of the pages belonging to the column store are stored in NUMA node 0’s local memory. Given free reign to schedule worker threads as it sees fit and an instance on which no other sessions are running, where will SQL OS elect to schedule the worker threads ?. With the CPU affinity mask set to its default, all the worker threads run on NUMA node 0 when the query is ran for the first time:


when the query is ran for a second time, NUMA node 1 is now used:


Every subsequent time the query is ran, SQL OS will use a different NUMA node.

I would like to thank Joe Chang for giving me a nudge in the right direction when investigating this.

Batch Hash Joins and “Cache Out”

Most people will be familiar the penalty paid when an iterator (hash join, sort etc) spills out of memory and into tempdb, “Cache out” is the generic term for spilling out of the various caches and memory as we travel from the CPU to storage:

cache out

This is the way the batch engine performs hash joins according to a reliable source:


The row store engine partitions hash tables across NUMA nodes, which I suspect to be the case the when the worker threads cannot fit onto a single NUMA node, conversely the batch engine will affinitize the hash table to a single NUMA node. If this is genuinely the case, a single hash table affinitized to a single CPU socket will cache out before the row mode equivalent partitioned across two CPU sockets does.

The size of the dimension table being joined to the column store fact can be increased up to the point that level 3 cache out takes place, causing the number of CPU cycles per row for the hash join iterator should jump up, there are several key points to take away here:

  • All memory access is CPU intensive
  • Accessing main memory is more CPU intensive than accessing memory pages inside  the CPU cache:

context switch

Why I Like Windows Performance Toolkit So Much

SQL Server 2014 provides an extended event which allows the CPU cycles per row to be captured for the hash aggregate iterator:


This is great but I want to capture CPU utilization statistics for the straight hash join iterator also. Herein lies the problem with the standard SQL Server tooling, you are limited by what the SQL Server development team at Microsoft wish to expose. If the public symbols are available for the release of SQL Server you are using, CPU utilization across the entire database engine can be captured and explored in windows performance analyzer.

This blog post should appear in the not too distant future, I hope you have enjoyed reading this one.

3 thoughts on “Under The Hood Of The Batch Engine: NUMA Support Part 2

  1. Would you agree to my conclusion that it didn’t matter much on what NUMA node the data resided? The difference seems to be at most 10%. More like 5% because node 1 seems to always be slower. (And why is that?!)

    • From a practical point of view, NUMA awareness would only be of value if you could partition your database in such a way it could be distributed evenly across CPU sockets and the querying workload was even across all partitions. This is the conundrum with anything which is OLAP related, in the OLTP world where queries should be serial, the execution plan shape should be reasonably uniform, i.e serial with nested loop joins, transactions short and the number of pages access per query low, NUMA is a good fit. NUMA node 1 might be slower because SQL OS flips NUMA nodes, i.e. it treats NUMA node 0 as 1, the rationale for this is because the OS usually loads this node up with own threads, thus making this the most naturally heaving loaded node. Trace flag 8025 can be used to disable the node switching behavior as per this blog post: http://blogs.msdn.com/b/psssql/archive/2011/11/11/sql-server-clarifying-the-numa-configuration-information.aspx. In the bigger picture there is actually ways and means of achieving higher scan rates by by-passing main memory full stop. The ingestion rate for a CPU socket is around 40GB/s, by using this “Other means” this can be bettered. Read about the Sandybridge micro architecture for clues on how to achieve this 😉

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s