According to general SQL Server ‘Lore’, hyper-threading works well for OLTP workloads and not so well for OLAP style workloads, but does this hold true for the batch engine ?, there is only one way to find out, this is my test lab server:
- Dell T5500 with 2 x 6 core 2.4Ghz Westmere Xeon processors
- 32GB 1600Mhz triple channel ECC RAM
- 1 x Fusion IO 2.4TB ioDrive 2 duo card.
- 2 x SanDisk 480GB Extreme SSD
- Windows Server 2012 R2
- SQL Server 2014 Enterprise Edition CU 6
- Trace flags 1117, 1118, 2330 and 8008 on
- SQL Server configured to use large memory pages
The test data is an inflated version of the FactInternetSales table created using this statement:
WITH generator AS ( SELECT TOP 500000 id = ROW_NUMBER() OVER (ORDER BY a) FROM (SELECT a = 1 FROM master.dbo.syscolumns) c1 CROSS JOIN master.dbo.syscolumns c2) SELECT a.DateKey AS OrderDateKey ,CAST(((id - 1) % 1048576) AS money) AS Price1 ,CAST(((id - 1) % 1048576) AS money) AS Price2 ,CAST(((id - 1) % 1048576) AS money) AS Price3 INTO FactInternetSalesBig FROM generator CROSS JOIN dbo.DimDate a
This is the test query which will use this table, once a clustered column store index has been created on FactInternetSalesBig:
SELECT d.CalendarQuarter ,SUM([Price1]) ,SUM([Price2]) ,SUM([Price3]) FROM [dbo].[FactInternetSalesBig] f JOIN [dbo].[DimDate] d ON d.DateKey = f.OrderDateKey GROUP BY d.CalendarQuarter
This is the execution plan for the query:
Lets take a look at how this query scales as the degree of parallelism is increased with and without hyper-threading on:
Disabling hyper-threading results in the lowest elapsed times up to a degree of parallelism of 16.
Takeaway #1 disabling hyper-threading results in the best batch mode performance under normal conditions.
When it comes to understanding how the database engine interacts with the CPU at a an intimate level, I am knowledgeable, both Thomas Kejser and Joe Chang are experts in this field. I thank Joe for providing me with general steer in this area and Thomas for the following observation:
Hyper-threading basically allows you to “hide” memory latency if you do well with the programming. In essence, while one Hyper-thread waits for memory stall, the other can get work done.If we now assert that something like this is going on in the join loop:while (rows < BIG_NUMBER) { /* * Incredibly fast for most cases, should disappear * in the noise. */ calculateHash() /* * Hash lookup causes a memory stall. */ lookupHashValue() /* * Can be hidden by the pipeline in most cases */ writeValue() rows++ }Because hash tables have very poor cache locality, you are almost guaranteed to miss a cache line (at least in L1, maybe in L3) when you call lookupHash unless you are very clever.Case 1: HT on, Adjacent Cache Line Pre-fetch ON– Each HT does the above loop– When they miss, they fetch 2 cache lines (Because prefetch is on)– They consume one of them– It is unlikely that the second one will be useful, and it may be evicted from cache before it is usedCase 2: HT on, Pre-fetch OFF– Each HT does the above loop– When they miss, they fetch 1 cache line (saving memory bandwidth)– While the HT is stalled for the fetch, the other HT does work– The thread consume the fetched line– Result: Higher cache bandwidth, because you are not abusing the memory bus fetching stuff you don’t need
Case 3: HT off, Pre-fetch ON– One thread per core does the above– When it misses, it fetches 2 lines– It consume one of them– cache pressure (on all caches, all the way out to L3) is now much smaller. You essentially have twice the cache space per thread– Result: Higher cache locality, because the chance of hitting an entry in the hash table you already pre-fetched is larger
Predictions– This effect should largely become invisible if you don’t join (and don’t aggregate, because that is hashing too)– You should be able to move the ratio around between the different scenarios by varying the size of the hash (=dimension) table. As you increase the hash table, the {HT on, Prefetch OFF} should become better and better compared to the other scenarios.– Once you reach a certain size of the hash table (=sufficiently large) the effect should disappear because you are now trashing TLB (unless radix-hashing has been implemented)Pre-fetching is poor if you do hashing a lot on big hash tables, unless you are very clever about the way you probe the hash buckets.
By their very nature, hash probes result in random memory access:
Creating a conventional clustered index on the FactInternetSalesBig table ordered on the OrderDateKey column followed by a clustered column store index using the WITH DROP_EXISTING clause makes the hash aggregate hash table probes become sequential in nature:
Repeating the test approximately halves the elapsed time for each degree of parallelism:
The difference that pre-sorting the data on OrderDateKey makes to the hash aggregate can be quantified (DOP of 24 and hyper-threading enabled):
282,160 ms (sampled cpu time) without presorting the data
91,517 ms (sampled cpu time) with presorted data
Takeaway #2 when the column store is created on pre-sorted data, disabling hyper-threading still results in the lowest elapsed time
Batch mode is far superior to row mode in terms of performance because batches which are a 1000 rows in size (as a rough rule of thumb) are designed to fit into a CPU’s level 2 cache and the way column stores are stored contiguously in the column store object pool helps the CPU memory pre-fetcher do its job.
Overcoming The Memory Wall
This graph depicts Moore’s law:
No other hardware component has kept pace with this curve be it the IO sub-system or main memory, which gives rise to this disparity between latency when accessing main memory and the CPU cache hierarchy:
This is what the CPU cache hierarchy looks like on a Core i-series Xeon:
The “Memory wall” is a performance wall CPU’s hit because main memory cannot keep up and is why the CPU cache hierarchy exists in the first place. The memory pre-fetcher scans memory access patterns for data and instructions retrieved into the level 2 cache and tries to preempt what is required next by pre-fetching the relevant data / instructions. More simply put, the job of the memory pre-fetcher is to hide the latency of main memory access.
The term “Locality of reference” is ubiquitous with any discussion around caching, it refers to the frequency at which the same value or location is accessed, of which there are two basic types:
- Spatial locality
The likelihood that adjacent memory locations will be accessed. - Temporal locality
The likelihood that when a particular memory location is accessed it will be accessed again within the very near future.
For anyone interested in further reading on the subject, I recommend this article, which I will pick out two things from; firstly this excerpt:
There’s an old rule of thumb that we add roughly one level of cache every 10 years, and it appears to be holding true into the modern era — Intel’s Haswell and Broadwell chips offer certain SKUs with an enormous L4, thereby continuing the trend.
Secondly, I will borrow this cache out curve graph from Anandtech that the article refers to, it illustrates the difference that the level 4 cache makes, this is available with some processor variants that use the Haswell micro-architecture. However, as each cache in hierarchy is dropped out of, throughput decreases. The red line represents the CPU with the level 4 cache:
There is something waiting in the wings called stacked memory, for which there are two competing standards:
- High bandwidth memory: AMD and Hynix
- Hyper memory cube technology: Backed by a consortium including Intel and Micron
Both technologies promise lower latency and the ability to deliver data to the CPU at and above the same rate as the level 3 cache; approximately 300GB/s. In the here and now leveraging the pre-fetcher is the best option. Most BIOS-es have two settings for controlling the pre-fetcher:
- Hardware pre-fetcher (on/off)
This enables pre-fetching behavior full stop of which adjacent cache line pre-fetching is an extension.
- Adjacent cache line pre-fetcher (on/off)
When this is enabled the pre-fetcher will not only fetch the requested cache line, but the subsequent one also. Remember than a cache line is the unit of transfer between main memory and the CPU, it is usually 64 bytes in size on modern Intel processors.
What Affect Do Different Pre-Fetcher Configurations Have On Hyper Threading ?
Using the test query and the column store created on the pre-sorted data, lets run the tests with hyper-threading enabled:
This is the graph for the same set of tests with hyper-threading disabled:
I was expecting performance to drop through the floor with pre-fetching disabled, this is patently not the case and it makes me suspect the batch engine implements its own (software) based pre-fetching mechanism.
The hyper-threading enabled best case scenario is with a degree of parallelism of 24 and adjacent cache line pre-fetching disabled resulting in an elapsed time of 3489 ms. With hyper-threading disabled the lowest elapsed time is 3866 ms, obtained with both the hardware and adjacent cache line pre-fetchers are enabled and the degree of parallelism is 20.
Why Does Disabling Adjacent Cache Line Pre-Fetching Make A Difference When Hyper-Threading Is Enabled ?
In essence turning off adjacent cache line pre-fetching when hyper-threading is enabled relieves pressure on the memory bus because each hyper-thread is only pre-fetching a single 64 byte cache line at a time instead of two.
I will wrap this post up with the final takeaway point:
Takeaway #3 with adjacent cache line pre-fetching turned off, the batch engine with hyper-threading enabled edges out hyper-threading with hyper-threading disabled.
Very interesting, as I am currently evaluating a new server for max performance in batch mode.
However I am unsure how you could manage DOP 24 with Hyperthreading disabled on a 2 socket server with 6 cores/socket. I would have thought that the maximum possible would be DOP 12 for HT disabled and DOP 24 with HT enabled. Am I missing something!?
Six cores and two sockets with HT enabled will give you a maxdop of 24, as you stated and 12 with HT disabled. However, you will find that the batch engine does not scale that well beyond a handful of cores. Note that in my work I took just about the simplest fact to dimension table join possible, your mileage will depends on query plan shape, in particular things which are always row my row by nature such as parallel exchange iterators, segments elimination, both in terms of pre-sorting data in order to encourage this and the fact that you cannot get this with certain data types and segment trimming to name but a few factors. If my post gave the impression you could get a maxdop of 24 with HT disabled, this was incorrect.
Thanks for coming back Chris, your SQLBits presentations were excellent!
This is the first time I hear that the batch engine doesn’t scale well beyond a handful of cores. In fact, I’d expect it to scale very well on large datasets shifting hundreds/thousands of segments. Does your experience indicate otherwise?
With SQL2016 supporting batch mode on more operators, I suspect that optimising hardware for batch mode may be worth the effort.
On a 2-socket 12/core/socket server (aiming for highest clock speed on 24 physical cores right now), what kind of general DOP setting would you guess batch mode would feel happier with? I’m guessing 6 or 8
I would say something around a dop 6-8 is the sweet spot for batch mode. When you express surprise that batch mode does not scale well, it is fantastically fast when you hit its sweet spot. However, you need to consider that hash joins and aggregates cannot keep up with a column store scan when most of the data is in the column store object pool and/or you are dealing with high performance storage. Also, consider probing a hash table, although there are batch hash join and aggregate operators, the probing of a hash table is in reality a row by row operation, the same goes for parallel exchange iterators. In fact the batch engine will throttle back the performance of column store scans so that hash joins/ aggregates upstream can keep up. The way that hash tables are partitioned, or not as the case is with the batch engine also adds nuances to the way it scales.
That is exactly why I’ll have to time important queries with different DOP settings and HT on/off. I’ll make a note here of the results, since there are not many googlable locations for hyperthreading+batch mode