In a session I have presented both at SQL Konferenz in Germany and the recent SQL Saturday in Portugal, I cover CPU architectures and memory at level 400. As a ‘Hook’ with which to pique the interest of the audience I present this conundrum:
Contrary to what you might expect based on conventional wisdom, the query that uses the larger column store completes in the lowest elapsed time. Despite the fact the column store scan is more expensive, the CPU savings realized from the hash aggregate that follow this (by turning random memory access hash probes into sequential memory access hash probes) result in a net CPU saving:
Using a tool by Intel called VTune, a low level CPU performance profiler, we can see the difference our hand crafted sorted hash aggregate makes to last level cache misses, or more simply put the number of times we cannot find what we want in the CPU cache:
The graph above depicts the last level cache ( LLC ) misses as we go from a degree of parallelism of 2 through to 24. however . . . Why Does The LLC Dip At DOP Of 14 Using The Non-Sorted Hash Aggregate ?
I was asked why there is the sudden jump in last level cache misses at a DOP of 10 for the query using the column store on the non pre-sorted data. I suspect this is because this is the tipping point at whether the volume of hash aggregate table probes are such that the last level cache gets swamped and ceases to be of any use in shielding the CPU cores from the latency of accessing main memory.
When I presented this session in Portugal I was asked this question and I did not have an answer at the time, but now I do. SQL OS tries to schedule all threads for a parallel query on what it deems to be the “Best available NUMA node”, which is the one with most free schedulers and memory. A degree of parallelism of 14 results in a physical CPU core on the second socket being used, 1 core and not 2 because hyper-threading is enabled. This single core has the entire last level cache to itself, as the degree of parallelism increases the CPU the last level of the cache has to be shared by all the cores on the socket. Because this one core has the whole last level of the cache to itself , we see the dip in last level cache misses. As the cores on the second NUMA node are loaded up, the last level cache miss curve ramps up because the contention on it increases. To recap in the CPU cache hierarchy, here is an image which has been ubiquitous with my blog posts most recently: