Under The Hood Of The Batch Engine: NUMA Support Part 1

There has been some recent discussion around whether the SQL Server batch engine is NUMA aware and does it support large pages, as I have been looking under the covers of this recently, I thought I would add some comments to this.

Is The Batch Engine NUMA Aware ?

I am not aware of any official Microsoft documentation that states whether the batch engine is NUMA aware or not, but I do have various stack traces at my disposal and some information from Thomas Kejser on how the batch engine performs hash joins. Whilst Thomas no longer works from Microsoft you will find his name on many articles and white papers produced from the days when he was a SQL CAT team member, firstly the stack trace:

batch engine NUMA

This was captured whilst I was running a query against SQL Server 2014 CU7, note the line right at the end of the excerpt suffixed with “CheckNumaLocality”.

Addendum
I would thank the first person to comment on this post (Mark) for his input, specifically that the buffer pool is aware, its a know and documented fact that there is a memory pool and clerk specifically for column store segments in the form of the column store object pool. Mark makes the comment that the segments are read in through the buffer pool, there is a line with msvcr100.dll!.memcpy on it, this seems to suggest the data is being copied elsewhere, if we continue reading up the stack the line sqlmin.dll!ColumnStoreObjectPool::ConstructObjectAndFix appears, proof positive that this somewhere is the column store object pool. The Microsoft research paper “Enhancement To SQL Server Column Store” mentions:

Column segments and dictionaries are brought into memory as needed during query processing. They are not stored in the buffer pool but in a new cache for large objects. Each object is stored contiguously on adjacent memory pages. This simplifies and speeds up scanning of a column because there are no “page breaks” to worry about.

Presumably this contiguous area is the column store object pool, unfortunately it mentions nothing about NUMA awareness. However, there are several other interesting things to note, firstly the line:

sqlmin!ColumnStoreLockBytesSS::ReadAtUncompressed

Some column store database engines, SAP HANA to name but one use “Late materialization”, more simply put they try to work with compressed data as much as possible, this document makes reference to this. Is SQL Server doing this ?, this may or may not be related to decompression of segments that have been subject to column store archive compression.

Secondly there is the line with the memcpy, which is expensive, aside from development effort and reusing the existing code base, what is the rationale for reading in column store segments via the buffer pool ?. What performance gain would we get if the column store object pool could consume column store segments directly from storage rather than having to go via the buffer pool, might future versions of SQL Server ( 2016 ? ) do this.

You would be forgiven for thinking that the segment read is performed via SQL OS, this seems like a reasonable assumption to make and I would go along with this, were it not for the fact I’ve look at the call stack and this indicates otherwise:

NUMA column store scatter read

The call to KernelBase.dll!ReadFileScatter is invoked from sqlmin, this encapsulates the main SQL Server execution engine and the storage engine, SQL OS is encapsulated by sqldk.dll and sqlos.dll:

dbengine

Is The Batch Engine NUMA Aware From Top To Bottom ?

When it comes to hash joins (I also believe this to be the case with hash aggregates), aside from the obvious fact that the batch engine is working at batch level, approximately 1000 rows per batch, it is a completely different beast to the row mode engine:

batch-hash

With the conventional row mode engine, hash tables are partitioned across NUMA nodes, the batch engine is a different beast in that the hash table is ‘Tied’ to one NUMA nodes. This does away with the overhead of having to re-partition data across input threads and skew speeds processing up, in that frequent values that appear in sequence in the batches used in hash probes will cause the part of the hash table the probes are hitting to rise up the CPU cache hierarchy and stay there for longer.

In summary, is the batch engine NUMA aware ? , after reading the call stack (correctly – as per Mark’s comment) I cannot conclusively prove this one way or another, however reliable sources have suggested to me that it is not NUMA aware.

5 thoughts on “Under The Hood Of The Batch Engine: NUMA Support Part 1

  1. It is my understanding that CS segments are loaded *through* the buffer pool and then decompressed into to the CS pool which is a separate memory pool (visible as a memory clerk in the DMVs). The call stack shows that the buffer pool does something NUMA aware. It does not show that the CS pool does something NUMA aware. CS allocations are not visible on this call stack.

      • I don’t have a blog and this is a throw-away identity (as always).

        Maybe you can find out where memory is allocated by executing in a loop:

        DBCC DROPCLEANBUFFERS
        –load 100MB of big, low-CPU CS data from disk (like 8000 byte distinct string values)
        select sum(data_length(someBigColumn)) from CSTable

        This should put CPU load onto the allocator and make it visible in xperf or PerfView.

      • The “Uncompressed” in the name of that one function could simply refer to archive compression. Even if archive compression is not in use that function might be called (and simply internally do not perform decompression).

  2. If you were to compare SQL 2012 and 2014 stack traces this might provide some hints as to whether the decompression relates to column store archive decompression. If the column store object pool is ** not ** NUMA aware and there is no way of telling this from the stack trace, the cost of memcpy-ing a cache entry from one NUMA node to another when copying data from the buffer pool could be up to 100 CPU cycles ( latency of the QPI ), much much more if the data is not in the CPU cahce hierarchy. Has I still access to VTune amplifier I would probably be able to tell definitely if the column store object pool is NUMA aware based on the movement of cache lines between CPU cores.

    Update
    The call to sqlmin.dll!ColumnStoreLockBytesSS::ReadAtUnCompressed still exists in a stack trace from SQL Server, because archive compression does not exist in this version, I’m inclined to think that the SQL Server batch engine does not use late materialisation . . . yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s