This blog post will look at hyper-threading and the throughput it provides for OTLP workloads.
How Does Hyper Threading Work ?
This varies according to which generation of Intel processor you are talking about, for the purpose of this blog I will focus on Intel CPU architectures and not AMD. A Xeon class CPU has a series of instruction pipelines:
One of the key objectives of getting as much work out the processor as possible is to fill each slot in the pipelines and avoid what are known as “Bubbles”:
Hyper-threading has historically had a bad reputation caused by the way it was implemented in the NetBurst architecture, this aimed to accommodate a separate thread of execution per CPU core by filling up as many available pipeline slots as possible, performance was disappointing due to the “Back end” of the execution pipelines struggling to accommodate the second thread.
Enter The “Core i” Series Architecture
Most people reading this blog post will be using Xeon processors based on the Core i series architecture (Nehalem micro-architecture onward):
The core architecture followed the NetBurst and omitted hyper-threading, it came back in the i series architecture both resurgent and re-designed:
In the current generation of Intel processors, when a CPU stall (aka last level cache miss) takes place, the pipeline slots which are empty whilst data is retrieved from main memory are used to schedule a second thread. With hyper-threading enabled SQL Server sees two logical processors per physical CPU core.
Time For A Test Drive
This is my test setup:
- Dell T5500 with 2 x 6 core 2.4Ghz Westmere Xeon processors
- 32GB 1600Mhz tripple channel ECC RAM
- 1 x Fusion IO 2.4TB ioDrive 2 duo card.
- 2 x SanDisk 480GB Extreme SSD
- Windows Server 2012 R2
- SQL Server 2014 Enterprise Edition CU 6
- Trace flags 1117, 1118, 2330 and 8008 on
- SQL Server configured to use large memory pages
- Delayed durability set to ‘FORCE’.
This is how the test test data is created:
CREATE TABLE dbo.MyBigTable ( c1 bigint NOT NULL CONSTRAINT DF_BigTable_c1 CONSTRAINT PK_BigTable PRIMARY KEY CLUSTERED ,c2 DATETIME CONSTRAINT DF_BigTable_c2 DEFAULT GETDATE() ,c3 CHAR (111) CONSTRAINT DF_BigTable_c3 DEFAULT 'a' ,c4 INT CONSTRAINT DF_BigTable_c4 DEFAULT 1 ,c5 INT CONSTRAINT DF_BigTable_c5 DEFAULT 2 ,c6 BIGINT CONSTRAINT DF_BigTable_c6 DEFAULT 42 ); ;WITH t4 AS (SELECT n FROM (VALUES(0),(0),(0),(0)) t(n)) ,t256 AS (SELECT 0 AS n FROM t4 AS a CROSS JOIN t4 AS b CROSS JOIN t4 AS c CROSS JOIN t4 AS d) ,t16M AS (SELECT ROW_NUMBER() OVER (ORDER BY (a.n)) AS num FROM t256 AS a CROSS JOIN t256 AS b CROSS JOIN t256 AS c) INSERT INTO dbo.MyBigTable SELECT CAST(CRYPT_GEN_RANDOM(8) AS BIGINT) ,GETDATE() ,'a' ,1 ,2 ,42 FROM t16M;
CREATE PROCEDURE [dbo].[usp_Insert] @RowsToInsert int = 2000000 AS BEGIN SET NOCOUNT ON; DECLARE @i BIGINT = 0 ,@base BIGINT = @@SPID * 10000000000; WHILE @i < @RowsToInsert BEGIN BEGIN TRANSACTION INSERT INTO dbo.MyBigTable (c1) VALUES (@base + @i) ,(@base + @i + 1); COMMIT; SET @i += 2; END; END;
- Hyper-threading can be enabled / disabled in the BIOS
- With hyper-threading enabled, if the CPU affinity mask is altered such that every even numbered logical processor is available, this has the same effect of turning hyper-threading off.
In the first test all logical CPUs will be disabled for NUMA node 1 and the first 6 logical processors will be enabled for NUMA node 0, this is the equivalent of running six schedulers on 3 physical CPU cores with hyper-threading enabled:
Running the concurrent insert procedure with 6 threads and the CPU affinity mask above results in an elapsed time of 1 minutes and 39 seconds. The CPU usage for this test when its in full flight looks like:
Utilization at the logical processor level for the database engine is surprisingly even:
Based on bench marks performed by Hewlett Packard (nothing to do with my current work engagement with them), the second hyper-thread per core results in an extra 30% throughput for certain workloads. This is why I have conducted a test with 6 threads running on 3 physical cores and then the same again but with 1 thread per core.
Lets try the test again, but this time with the CPU affinity mask set up as follows:
Running the concurrent insert test with 6 threads completes with an elapsed time of 1 minutes and 2 seconds. This is what the CPU utilization per logical processor looks like according to task manager:
There is one last test to be performed using the synthetic sequential key, that is to use all 6 physical cores on the socket with 12 threads and hyper-threading turned on, this test completes in 1 minutes and 42 seconds. Here are all the results so far summarised in a simple graph:
Repeating the same tests but using a non-sequential key based on NEWID() results in the following timings:
Whilst Hekaton was still fresh in my mind from the blog post on scaling singleton inserts, I tested this also. This is the T-SQL for creating and seeding the table for this test:
CREATE TABLE [dbo].[MyBigTableIm] ( [c1] [bigint] NOT NULL ,[c2] [datetime] NULL CONSTRAINT [DF_BigTableIm_c2] DEFAULT (getdate()) ,[c3] [char](111) COLLATE SQL_Latin1_General_CP1_CI_AS NULL CONSTRAINT [DF_BigTableIm_c3] DEFAULT ('a') ,[c4] [int] NULL CONSTRAINT [DF_BigTableIm_c4] DEFAULT ((1)) ,[c5] [int] NULL CONSTRAINT [DF_BigTableIm_c5] DEFAULT ((2)) ,[c6] [bigint] NULL CONSTRAINT [DF_BigTableIm_c6] DEFAULT ((42)), PRIMARY KEY NONCLUSTERED HASH ([c1]) WITH ( BUCKET_COUNT = 2097152) ) WITH ( MEMORY_OPTIMIZED = ON ,DURABILITY = SCHEMA_AND_DATA )
This is natively compiled test procedure for performing the insert test with Hekaton:
CREATE PROCEDURE [dbo].[usp_InsertIm] @spid bigint ,@RowsToInsert int = 2500000 WITH NATIVE_COMPILATION, SCHEMABINDING, EXECUTE AS OWNER BEGIN ATOMIC WITH ( TRANSACTION ISOLATION LEVEL = SNAPSHOT ,LANGUAGE = N'us_english') DECLARE @i INTEGER = 0 ,@base BIGINT = @spid * 10000000000; WHILE @i < @RowsToInsert BEGIN BEGIN INSERT INTO dbo.MyBigTableIm (c1) VALUES (@base + @i); END; SET @i += 1; END; END;
#1 Obtaining The Lowest Elapsed Times For Individual Threads
Individual insert threads complete in the lowest elapsed time when they have a whole CPU core to themselves, possibly due in part to the fact that that they are not contending with another thread for the CPU cache.
#2 Obtaining The Best Hardware Utilization
To get the best possible throughput out of all available CPU cores, using all cores with hyper-threading turned on yields the best results when using the conventional database engine.
#3 CPU Utilization Per Logical Processor
Despite CPU utilization obtained per logical processor using the same physical CPU core being surprisingly even, hyper-threading does not double the CPU capacity available to SQL Server, or any other application for that matter.
With the conventional non-hekaton engine, 12 threads double the throughput at the expense of a very small increase in elapsed time, this is not the case with Hekaton and the tests performed here, it is something which requires an in depth investigation and a subject for another blog post.