There has been some recent discussion around whether the SQL Server batch engine is NUMA aware and does it support large pages, as I have been looking under the covers of this recently, I thought I would add some comments to this.
Is The Batch Engine NUMA Aware ?
I am not aware of any official Microsoft documentation that states whether the batch engine is NUMA aware or not, but I do have various stack traces at my disposal and some information from Thomas Kejser on how the batch engine performs hash joins. Whilst Thomas no longer works from Microsoft you will find his name on many articles and white papers produced from the days when he was a SQL CAT team member, firstly the stack trace:
This was captured whilst I was running a query against SQL Server 2014 CU7, note the line right at the end of the excerpt suffixed with “CheckNumaLocality”.
I would thank the first person to comment on this post (Mark) for his input, specifically that the buffer pool is aware, its a know and documented fact that there is a memory pool and clerk specifically for column store segments in the form of the column store object pool. Mark makes the comment that the segments are read in through the buffer pool, there is a line with msvcr100.dll!.memcpy on it, this seems to suggest the data is being copied elsewhere, if we continue reading up the stack the line sqlmin.dll!ColumnStoreObjectPool::ConstructObjectAndFix appears, proof positive that this somewhere is the column store object pool. The Microsoft research paper “Enhancement To SQL Server Column Store” mentions:
Column segments and dictionaries are brought into memory as needed during query processing. They are not stored in the buffer pool but in a new cache for large objects. Each object is stored contiguously on adjacent memory pages. This simplifies and speeds up scanning of a column because there are no “page breaks” to worry about.
Presumably this contiguous area is the column store object pool, unfortunately it mentions nothing about NUMA awareness. However, there are several other interesting things to note, firstly the line:
Some column store database engines, SAP HANA to name but one use “Late materialization”, more simply put they try to work with compressed data as much as possible, this document makes reference to this. Is SQL Server doing this ?, this may or may not be related to decompression of segments that have been subject to column store archive compression.
Secondly there is the line with the memcpy, which is expensive, aside from development effort and reusing the existing code base, what is the rationale for reading in column store segments via the buffer pool ?. What performance gain would we get if the column store object pool could consume column store segments directly from storage rather than having to go via the buffer pool, might future versions of SQL Server ( 2016 ? ) do this.
You would be forgiven for thinking that the segment read is performed via SQL OS, this seems like a reasonable assumption to make and I would go along with this, were it not for the fact I’ve look at the call stack and this indicates otherwise:
The call to KernelBase.dll!ReadFileScatter is invoked from sqlmin, this encapsulates the main SQL Server execution engine and the storage engine, SQL OS is encapsulated by sqldk.dll and sqlos.dll:
Is The Batch Engine NUMA Aware From Top To Bottom ?
When it comes to hash joins (I also believe this to be the case with hash aggregates), aside from the obvious fact that the batch engine is working at batch level, approximately 1000 rows per batch, it is a completely different beast to the row mode engine:
With the conventional row mode engine, hash tables are partitioned across NUMA nodes, the batch engine is a different beast in that the hash table is ‘Tied’ to one NUMA nodes. This does away with the overhead of having to re-partition data across input threads and skew speeds processing up, in that frequent values that appear in sequence in the batches used in hash probes will cause the part of the hash table the probes are hitting to rise up the CPU cache hierarchy and stay there for longer.
In summary, is the batch engine NUMA aware ? , after reading the call stack (correctly – as per Mark’s comment) I cannot conclusively prove this one way or another, however reliable sources have suggested to me that it is not NUMA aware.