The In-Memory OLTP Engine and NUMA Foreign Memory Access: Part 1

In my Saturday community day session at SQL Bits I mentioned an article by Linchi Shea from the sqlblog site in which he demonstrates that the overhead of 100% foreign memory access does not incur the drop in query throughput you might think. The article can be found here. This got me thinking, what if you were to re-produce a similar test with the in-memory OLTP engine, given that we have zero overhead for locking, latching and zero overhead for interpreting the SQL language, would the effect of remote memory access be more pronounced ?.

To recap on what NUMA is, NUMA is the abbreviation for non-uniform memory access, essentially each CPU socket has its own private memory bus to its own bank of memory. SQLOS is NUMA aware and create a memory node for each NUMA node:

NUMA architecture

The rationale for NUMA existing in the first place is that prior to this all CPU sockets shared a single memory bus, as the number of CPUs in a server increased, so did the pressure on this single bus thus impeding scalability:

WHY numa

This is my test server:


Using this I’m going to carry out three tests:

  1. Set min memory to 16GB, max memory to 28GB and the CPU affinity mask to use only the first NUMA node, as the 64GB of memory is divided evenly between sockets. My hope is that by setting max memory to a value less than half of this (less because the operating system will require some memory), the memory access for node 0 should be biassed towards local NUMA node memory access, querying sys.dm_os_memory_nodes hints that this might be the case:
    node 0 

    If I change the CPU affinity mask to only use node 1, this is what sys.dm_os_memory_nodes returns:

    NODE 1

    The key column to look at is page_kb, defined in books only from this article as:

    Specifies the amount of committed memory, in KB, which is allocated from this NUMA node by Memory Manager Page Allocator. Not nullable.

  2. As I have ten cores per socket, I will then fire off ten concurrent threads each of which will insert 250,000 rows into a memory optimized table with a hash index.
  3. Same test as per the previous one, except with the CPU affinity mask set to use only the cores from the second CPU socket.
  4. Similar test as 1. but with max memory set to 56Gb and the CPU affinity mask set to use 5 cores from each socket.

I’m going to use ostress from rml utilities to fire off ten threads and this is the code I will use:

CREATE TABLE [dbo].[MyBigTableIm] (
     [c1] [uniqueidentifier] NOT NULL DEFAULT NEWID()
    ,[c2] [datetime]         NOT NULL DEFAULT GETDATE()
    ,[c3] [char](17)         NOT NULL DEFAULT (CONVERT([char](17),rand()))
    ,[c4] [int]              NOT NULL DEFAULT (round(rand()*(1000)  ,(0)))
    ,[c5] [int]              NOT NULL DEFAULT (round(rand()*(10000) ,(0)))
    ,[c6] [bigint]           NOT NULL DEFAULT (round(rand()*(100000),(0)))

                                   ) WITH ( BUCKET_COUNT = 524288)

Some code will be required to perform the actual inserts, two procedures in fact, a ‘Driver’ procedure:

CREATE PROCEDURE [dbo].[usp_insertImMain]
 LANGUAGE                    = N'English')
    DECLARE @i INT = 0;

    WHILE @i < 250000
	EXEC dbo.usp_insertIm
	SET @i += 1;

and a procedure to perform the actual insert which is invoked by the other procedure:

CREATE PROCEDURE [dbo].[usp_insertIm]
 LANGUAGE                    = N'English')

So as to ensure that the environment is ‘Clean’ each time a test is run the database engine service is stopped and started before hand, also I ran each test twice in succession, here are the results:

im insert

Interpreting The Results

  1. My initial thoughts on this were that node swapping was at play here as stated in the original version of this post, however according to @sqlL_handle this no longer takes place. Therefore, what SQLOS sees as node 0 is the node that the operating system loads up with threads when the server is started up, this explains why the test runs faster on SQLOS node 1 than SQLOS node 0. Thanks to @sqL_handle for pointing this out:


  2. Why does the test run faster with 5 threads running on each node ( CPU socket ). The table when fully populated takes up 248MB in memory, this means that it can be easily accommodated within each bank of memory associated with each CPU socket. However its my belief that the translation look-aside buffer (TLB) plays a key part in why the third test is the fastest. The translation look-aside buffer is a page table which helps to translate the address of virtual memory pages to the addresses of physical pages. The nature of a hash index for a memory optimised table is such that rows are evenly hashed across all buckets, I suspect this leads to the TLB being thrashed when only using one socket, remember that there is one memory controller per socket and this contains the TLB. I suspect that when there are two sockets used in the test leveraging eight memory channels (four per socket) might also help:inter core

Addendum 08/05/2016

This Microsoft article gives mention to NUMA imbalance, again thanks to @sqL_handle for bringing this to my attention, it is something I need to investigate in order to determine whether this turns my original prognosis on its head. Luckily I have some ideas as to how windows performance tool kit might prove / disprove this. I have only skimmed the article, as such I am not aware what its opinion is regarding the placement of data when the CPU affinity mask is altered.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s