SQL Server, Hyper-Threading and OLTP Workloads

This blog post will look at hyper-threading and the throughput it provides for OTLP workloads.

How Does Hyper Threading Work ?

This varies according to which generation of Intel processor you are talking about, for the purpose of this blog I will focus on Intel CPU architectures and not AMD. A Xeon class CPU has a series of instruction pipelines:

Pipelined

One of the key objectives of getting as much work out the processor as possible is to fill each slot in the pipelines and avoid what are known as “Bubbles”:

bubbles

Hyper-threading has historically had a bad reputation caused by the way it was implemented in the NetBurst architecture, this aimed to accommodate a separate thread of execution per CPU core by filling up as many available pipeline slots as possible, performance was disappointing due to the “Back end” of the execution pipelines struggling to accommodate the second thread.

Enter The “Core i” Series Architecture

Most people reading this blog post will be using Xeon processors based on the Core i series architecture (Nehalem micro-architecture onward):

Core i road map

The core architecture followed the NetBurst and omitted hyper-threading, it came back in the i series architecture both resurgent and re-designed:

CpuStall

In the current generation of Intel processors, when a CPU stall (aka last level cache miss) takes place, the pipeline slots which are empty whilst data is retrieved from main memory are used to schedule a second thread. With hyper-threading enabled SQL Server sees two logical processors per physical CPU core.

taskmgr NUMA

Time For A Test Drive

This is my test setup:

  • Dell T5500 with 2 x 6 core 2.4Ghz Westmere Xeon processors
  • 32GB 1600Mhz tripple channel ECC RAM
  • 1 x Fusion IO 2.4TB ioDrive 2 duo card.
  • 2 x SanDisk 480GB Extreme SSD
  • Windows Server 2012 R2
  • SQL Server 2014 Enterprise Edition CU 6
  • Trace flags 1117, 1118, 2330 and 8008 on
  • SQL Server configured to use large memory pages
  • Delayed durability set to ‘FORCE’.

This is how the test test data is created:

CREATE TABLE dbo.MyBigTable (
     c1 bigint NOT NULL
         CONSTRAINT DF_BigTable_c1
         CONSTRAINT PK_BigTable PRIMARY KEY CLUSTERED
    ,c2 DATETIME
         CONSTRAINT DF_BigTable_c2 DEFAULT GETDATE()
    ,c3 CHAR (111)
         CONSTRAINT DF_BigTable_c3 DEFAULT 'a'
    ,c4 INT
         CONSTRAINT DF_BigTable_c4 DEFAULT 1
    ,c5 INT
         CONSTRAINT DF_BigTable_c5 DEFAULT 2
    ,c6 BIGINT
         CONSTRAINT DF_BigTable_c6 DEFAULT 42
);

;WITH  t4    AS (SELECT     n
                 FROM       (VALUES(0),(0),(0),(0)) t(n))
      ,t256  AS (SELECT     0    AS n
                 FROM       t4   AS a
                 CROSS JOIN t4   AS b
                 CROSS JOIN t4   AS c
                 CROSS JOIN t4   AS d)
      ,t16M  AS (SELECT ROW_NUMBER()
                        OVER (ORDER BY (a.n)) AS num
                 FROM       t256 AS a
                 CROSS JOIN t256 AS b
                 CROSS JOIN t256 AS c)
INSERT INTO dbo.MyBigTable
SELECT  CAST(CRYPT_GEN_RANDOM(8) AS BIGINT)
       ,GETDATE()
       ,'a'
       ,1
       ,2
       ,42
FROM   t16M;
Using the T-SQL above results in a table which is 2.4GB in size.
This is the stored procedure used to perform inserts into MyBigTable
CREATE PROCEDURE [dbo].[usp_Insert]
    @RowsToInsert int = 2000000
AS
BEGIN
    SET NOCOUNT ON;

    DECLARE  @i    BIGINT = 0
            ,@base BIGINT = @@SPID * 10000000000;

    WHILE @i < @RowsToInsert
    BEGIN
        BEGIN TRANSACTION
            INSERT INTO dbo.MyBigTable (c1)
                VALUES  (@base + @i)
                       ,(@base + @i + 1);
        COMMIT;
        SET @i += 2;
    END;
END;
I’m going to test the effect of hyper-threading on a single socket, there are two ways of doing this:
  • Hyper-threading can be enabled / disabled in the BIOS
  • With hyper-threading enabled, if the CPU affinity mask is altered such that every even numbered logical processor is available, this has the same effect of turning hyper-threading off.

In the first test all logical CPUs will be disabled for NUMA node 1 and the first 6 logical processors will be enabled for NUMA node 0, this is the equivalent of running six schedulers on 3 physical CPU cores with hyper-threading enabled:

Aff mask ht

Running the concurrent insert procedure with 6 threads and the CPU affinity mask above results in an elapsed time of 1 minutes and 39 seconds. The CPU usage for this test when its in full flight looks like:

OLTP 6 threads 3 cores

Utilization at the logical processor level for the database engine is surprisingly even:

cpu usage

Based on bench marks performed by Hewlett Packard (nothing to do with my current work engagement with them), the second hyper-thread per core results in an extra 30% throughput for certain workloads. This is why I have conducted a test with 6 threads running on 3 physical cores and then the same again but with 1 thread per core.

Lets try the test again, but this time with the CPU affinity mask set up as follows:

Aff mask no htRunning the concurrent insert test with 6 threads completes with an elapsed time of 1 minutes and 2 seconds. This is what the CPU utilization per logical processor looks like according to task manager:

OLTP 6 threads 6 cores

There is one last test to be performed using the synthetic sequential key, that is to use all 6 physical cores on the socket with 12 threads and hyper-threading turned on, this test completes in 1 minutes and 42 seconds. Here are all the results so far summarised in a simple graph:

HT sequential key

Repeating the same tests but using a non-sequential key based on NEWID() results in the following timings:

HT random key

Hekaton

Whilst Hekaton was still fresh in my mind from the blog post on scaling singleton inserts, I tested this also. This is the T-SQL for creating and seeding the table for this test:

CREATE TABLE [dbo].[MyBigTableIm] (
     [c1] [bigint]
         NOT NULL
    ,[c2] [datetime]
         NULL CONSTRAINT [DF_BigTableIm_c2]
         DEFAULT (getdate())
    ,[c3] [char](111)
         COLLATE SQL_Latin1_General_CP1_CI_AS NULL
         CONSTRAINT [DF_BigTableIm_c3]  DEFAULT ('a')
    ,[c4] [int]
         NULL CONSTRAINT [DF_BigTableIm_c4]
         DEFAULT ((1))
    ,[c5] [int]
         NULL CONSTRAINT [DF_BigTableIm_c5]
         DEFAULT ((2))
    ,[c6] [bigint]
         NULL CONSTRAINT [DF_BigTableIm_c6]
         DEFAULT ((42)),
     PRIMARY KEY NONCLUSTERED HASH ([c1])
     WITH ( BUCKET_COUNT = 2097152)
) WITH ( MEMORY_OPTIMIZED = ON
        ,DURABILITY       = SCHEMA_AND_DATA )

This is natively compiled test procedure for performing the insert test with Hekaton:

CREATE PROCEDURE [dbo].[usp_InsertIm]
     @spid         bigint
    ,@RowsToInsert int    = 2500000
WITH NATIVE_COMPILATION, SCHEMABINDING, EXECUTE AS OWNER
BEGIN ATOMIC
    WITH ( TRANSACTION ISOLATION LEVEL = SNAPSHOT
          ,LANGUAGE                    = N'us_english')

    DECLARE  @i    INTEGER = 0
            ,@base BIGINT  = @spid * 10000000000;

    WHILE @i < @RowsToInsert
    BEGIN
        BEGIN
            INSERT INTO dbo.MyBigTableIm (c1)
                        VALUES (@base + @i);
        END;
        SET @i += 1;
     END;
END;

hekaton

Takeaways

#1 Obtaining The Lowest Elapsed Times For Individual Threads

Individual insert threads complete in the lowest elapsed time when they have a whole CPU core to themselves, possibly due in part to the fact that that they are not contending with another thread for the CPU cache.

#2 Obtaining The Best Hardware Utilization

To get the best possible throughput out of all available CPU cores, using all cores with hyper-threading turned on yields the best results when using the conventional database engine.

#3 CPU Utilization Per Logical Processor

Despite CPU utilization obtained per logical processor using the same physical CPU core being surprisingly even, hyper-threading does not double the CPU capacity available to SQL Server, or any other application for that matter.

#4 Hekaton

With the conventional non-hekaton engine, 12 threads double the throughput at the expense of a very small increase in elapsed time, this is not the case with Hekaton and the tests performed here, it is something which requires an in depth investigation and a subject for another blog post.

2 thoughts on “SQL Server, Hyper-Threading and OLTP Workloads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s