In a session I have presented both at SQL Konferenz in Germany and the recent SQL Saturday in Portugal, I cover CPU architectures and memory at level 400. As a ‘Hook’ with which to pique the interest of the audience I present this conundrum:
Contrary to what you might expect based on conventional wisdom, the query that uses the larger column store completes in the lowest elapsed time. Despite the fact the column store scan is more expensive, the CPU savings realized from the hash aggregate that follow this (by turning random memory access hash probes into sequential memory access hash probes) result in a net CPU saving:
Using a tool by Intel called VTune, a low level CPU performance profiler, we can see the difference our hand crafted sorted hash aggregate makes to last level cache misses, or more simply put the number of times we cannot find what we want in the CPU cache:
The graph above depicts the last level cache ( LLC ) misses as we go from a degree of parallelism of 2 through to 24. however . . . Why Does The LLC Dip At DOP Of 14 Using The Non-Sorted Hash Aggregate ?
Addendum 26/05/2015
I was asked why there is the sudden jump in last level cache misses at a DOP of 10 for the query using the column store on the non pre-sorted data. I suspect this is because this is the tipping point at whether the volume of hash aggregate table probes are such that the last level cache gets swamped and ceases to be of any use in shielding the CPU cores from the latency of accessing main memory.
When I presented this session in Portugal I was asked this question and I did not have an answer at the time, but now I do. SQL OS tries to schedule all threads for a parallel query on what it deems to be the “Best available NUMA node”, which is the one with most free schedulers and memory. A degree of parallelism of 14 results in a physical CPU core on the second socket being used, 1 core and not 2 because hyper-threading is enabled. This single core has the entire last level cache to itself, as the degree of parallelism increases the CPU the last level of the cache has to be shared by all the cores on the socket. Because this one core has the whole last level of the cache to itself , we see the dip in last level cache misses. As the cores on the second NUMA node are loaded up, the last level cache miss curve ramps up because the contention on it increases. To recap in the CPU cache hierarchy, here is an image which has been ubiquitous with my blog posts most recently:
Is this speedup (from the numbers it looks like 2x) only due to cache hits or is there run length encoding coming into play? I know that CS indexes use RLE under some circumstances. I was able to kind of produce data that shows this taking place measured by compressed size.
Mark, the run length encoding is due to the the probes on the hash aggregate following the column store scan using memory pages in the CPU cache, this is because of the data has been pre-ordered on the OrderDateKey column prior to the creation of the column store index. This not only helps the pre-fetcher do its job by being able to identify a nice to spot sequential memory access pattern, but certain parts of the CPU cache hierarchy benefit from sequential access even when its in the same memory page.