The Microsoft research white “Enhancements made to column stores” mentions the reduction in machine instructions per row when comparing the batch mode to row mode hash join operator:
“The reduction in CPU time for hash join is very significant. One test showed that regular row-mode hash join consumed about 600 instructions per row while the batch-mode hash join needed about 85 instructions per row and in the best case (small, dense join domain) was a low as 16 instructions per row. However, the SQL Server 2012 implementation has limitations: the hash table must fit entirely in memory and it supports only inner join.”
In his keynote at the technical launch of SQL Server 2012 in the UK, Connor Cunningham also placed an emphasis on more efficient processing by doing more with fewer machine instructions. Therefore I thought I would take a slight diversion and touch on how modern Xeon processors process instructions and then tie this into batch mode.
A brief bit of history first, all processors follow a FETCH-DECODE-EXECUTE cycle, John Von Neumann established this principle in the mid 40s. Modern processors implement the FETCH-DECODE-EXECUTE cycle by having a pre-fetch unit, a series of decoders and execution units. This functionality usually manifests in a series of processing pipelines. Sandybridge core based Xeon’s ( ordinarily ) decode four instructions per clock cycle. The metric “Clock Cycles per Instruction” embodies the efficiency of a program using a CPU, the theoretical best case being 0.25 ( four instructions per clock cycle ) on average per core for Sandybridge Xeon processors. There are a lot of SQL Server related metrics which have become DBA ‘Lore’, page life expectancy to name but one. When it comes to performance at the most fundamental level, CPI is paramount and almost “The one metric to rule them all”.
Intel provide tools for analysing where processor clock cycles are expended, in my testing I have used VTune Amplifier XE, the documentation for which mentions that a CPI of 1 is a cause for concern. What degrades the Clock Cycles per Instruction figure ?, the usual suspects for this are:
- “CPU stalls” when data is not in the on CPU cache.
- Branch miss-prediction, during the FETCH-DECODE-EXECUTE cycle, certain language constructs such as IF statements, CASE and GOTO statements ( no one should be using these in the year 2014 !!! ) cause branching in the code path. The CPU needs to guess as to which branch the code path is going to go down, getting this wrong involves flushing the instruction pipelines and reloading them with the ‘Correct’ instructions, wasting CPU cycles in the process. Branch prediction logic in the CPU and / or functionality built into the compiler can help mitigate against this. Despite the fact that with each new Intel micro architecture the branch prediction units become more sophisticated, some branch miss-prediction is still likely to take place.
The word ‘Ordinarily’ was used in relation to the four instructions processed per clock cycle because the use of special machine instructions; dubbed Single Instruction Multiple Data ( SIMD ) can help decrease the CPI figure. SIMD is a vendor generic term that refers to instructions that can process data held in multiple CPU registers simultaneously. With respect to Intel processors, these were first introduced in Pentium III processors and enhanced with “Advanced Vector Extensions” ( AVX ) introduced with the Sandybridge CPU micro architecture. SIMD instructions are leveraged via special libraries compiled into executable images, the use of assembly language instructions embedded into 3 GL code via ASM directives or via the actual compiler. The AVX extensions for the Sandybridge based Xeons allow the processing of sixteen single precision floating point instructions and eight double precision floating point operations per cycle ( per core ).
“In memory technology” has been one of the hottest topics in information technology over the last few years. Lets start by taking a look at the memory hierarchy on the actual CPU:
The on CPU memory hierarchy consists of three levels:
- Level 1
This is a private cache per core with 32K assigned to instructions and 32K to data
- Level 2
A 256k data only private per core cache.
- Level 3
For the six core Sandybridge based Xeon this is a 15M cache shared across all cores.
The general trend is that as each cache in the hierarchy gets faster ( lower in latency ) it also gets smaller. If you cast your mind back to the “Cache out curve” slide:
To put this into the context of the penalty paid in clock cycles when accessing the different caches in different manners, the following information comes from this page and relate to the Intel I7 Sandybridge-E 3960X:
How are the on CPU caches best leveraged ?, the short answer is to minimise the number of trips that have to be made to main memory due to not finding what we want in the on CPU caches:
- Use memory structures that are conducive to sequential access such as arrays, avoid linked list and hash tables. In much the same way that column store technology originated in academia, it begs the question whether someone somewhere is working on a type of hash join that is less prone to random memory access, a sorted-hash join if such a thing could exist.
- Data travels from main memory in the CPU in 64 byte cache lines, for the best possible performance align data structures on 64 byte boundaries.
- Cram as much data into the L2/3 cache as possible by using small/narrow data types and compression.
- Access memory sequentially in order to encourage prefetching.
- Avoid contention of the L2/3 cache ( difficult ) such that threads get a “Long run” at using the L2/3 cache without other threads jumping in and flushing out the data/instructions that the previously incumbent thread used, this can be achieved part by minimising context switching. Based on this I have wondered if there is a use case for Microsoft increasing the SQL OS scheduler quantum to somewhere above its current 4 ms.
- ‘Affinitize’ ( partition ) the use of L2/3 in order to minimize cache misses when threads yield use of the CPU cores.
Cache misses, branch miss-prediction and data dependencies can lead to ‘Bubbles’ in the execution pipelines, or to use a less abstract analogy, empty slots containing no instructions. These bubbles can be filled in using hyper threads, that is the scheduling of two threads simultaneously per core by using spare pipeline capacity ( ‘Bubbles’ / empty slots ) to process another thread. This leads to the operating system seeing two logical processors per physical CPU core. With regard to the on CPU memory hierarchy, the L1/2 caches end up being split equally between hyper threads. In contrast to this AMD use dedicated hardware to process threads, what AMD refer to as ‘Strong’ threads.
We have now come full circle, this is why in SQL Server 2012 onwards batch mode exists, more specifically it is why:
- The execute engine feeds column stores into the CPU cache in batch sized chunks..
- Batch mode operators exist which pass rows around in batches of 1000
( on average ).
- In the afore-mentioned Microsoft white paper “Enhancements Made To Column Stores”, a new object cache exists in SQL Server 2014:
Column segments and dictionaries are brought into memory as needed during query processing. They are not stored in the buffer pool but in a new cache for large objects. Each object is stored contiguously on adjacent memory pages.
In the next post in the series, I will look at scaling a simple fact / dimension table join which will include some analysis of the CPI as the degree of parallelism increases.