I will start by doing something which might sound strange on face value, and that is to apologise. I appreciate that I have left my series on super scaling the parallel insert in mid-air and I have every intention of finishing this off. However, commitments to present at SQL Bits and SQL Konferenz in Germany have diverted my attention elsewhere. On a more positive note I have just completed a considerable sized deck entitled “SQL Server Engine: CPU Cache as The New Ram”. As I have done before, I’m going to take some slides from this, embed them in this blog post and provide a commentary them. I should mention that this does not include Hekaton for the simple fact that Hekaton is a beast of main memory and you can’t fit much as whole tables go in a L3 cache, i.e. memory optimised tables have to fit into memory period.
In this post I will cover three things:
1. Last Level Cache Hits and what leveraging the CPU cache is all about.
2. How large memory pages work.
3. How you can relieve pressure on the LOGCACHE_ACCESS spinlock, which is the last frontier when it comes to optimising a workload which requires logging, OLTP workloads being the natural fit for this.
First of all what is the world of “CPU cache as the new RAM” all about, it’s about minimising trips out to main memory, this relates to two things:
- virtual memory mappings
decoded CPU instructions should always live in the CPU cache hierarchy. The red bar in the chart below represents the penalty we pay for making trips out to main memory. The 167 CPU cycle figure is a conservative figure in that any trips to main memory for reading the page table are not included within it.
Main memory has not kept up with the CPU, there ARE solutions to this, some of them involve software others hardware:
The trend with CPU design has been for caches to get larger, intel has developed a L4 cache called “Crystall well”, this is 128Mb of ram that sits on the same package ( not die ) as the CPU. DDR4 memory will soon become the defacto memory standard. The second generation i-Series architecture ‘Sandybridge’ introduces something interesting whereby IO devices that use the PCI bus can by pass main memory and ‘Poke’ data directly into a CPUs L3 cache:
One of the main drivers behind this technology was gigabit ethernet, specifically a wish to reduce latency and power consumption when the NIC is talking to a CPU. I believe that Data Direct IO technology can also be used by some quad port host bus adapters.
The graph below is based on information from an Intel white paper on Data Direct IO:
Finally there are two competing consortiums behind stacked memory cube technology, one includes a group including Intel and Micron, the other includes AMD and Hynix. Memory cube technology can deliver the same bandwidth to the CPU that the L3 cache can. It is the leveraging of the prefetcher that I will focus on. Put simply the prefetcher scans memory accesses and looks for sequential scan patterns that would benefit from read aheads, that is the prefetching of data from main memory into the CPU cache hierarchy. The end game of the prefetcher is to preëmpt what data a process or thread requires next such that it can find it in the CPU cache and not make a trip out to main memory.
I have blogged about a scenario where it is possible to create two column stores on the same data, one is created on a pre sorted version of the data resulting in a larger column store than the other one created on the non sorted data. When the larger column store joins to a dimension table in a very simple OLAP style query, this results in better performance. The catch being that the column store created on the pre sorted data is twice the size of the one create on the non sorted data, this is the takeaway from this whole exercise:
The performance comes from the column store created on data pre sorted on the join key resulting in sequential memory access when probing the hash aggregate hash table immediately after the column store scan. This results in the graph above, if the random memory probes on hash aggregate hash table are reduced ( or join for that matter ), this outweighs the extra cost of scanning in the larger column store and then some. The most critical thing to take away from this is the CPU stalls or last level ( L3 ) cache misses for the two different column stores.
A similar principle of paying a penalty for going to main memory applies to virtual memory mapping information, the look aside buffer ( TLB ) caches logical to physical memory mapping information within the memory controller on the CPU die. There is a significant difference in CPU cycles when accessing the TLB compared to the virtual memory table in main memory:
In the core i series Xeons ( Nehalem and later ) the TLB is hierarchical, the latency when accessing this is of the order of 10s of CPU cycles, the latency when accessing the in-memory page table is around 167 CPU cycles or more. The size of the TLB is determined by the CPU micro architecture of the CPU in question, for the first and second generation core i series architectures it looks like this:
Tools such as Intel’s VTune amplifier by allow statistics on the access to the different levels within the TLB hierarchy to be collected.
If we enable the use of large memory pages, the logical to physical memory mapping area covered increases significantly:
The upshot of this is that by enabling large memory page support, the likelihood in having to trawl the virtual page table in main memory is reduced. What impact does this have on performance, for the scan of a 118Gb table on an instance with a cold buffer cache, max memory setting of 40Gb and one socket with six cores; we get a 28% reduction in elapsed time, page lookups go from 222220 a second to 263766 a second:
It should also be noted that this helps the prefetcher in that the prefetcher will always scan up to page boundaries, also the larger the pages are, the less memory fragmentation there is likely to be, resulting in more sequential memory scans. As is often the case in life; there is no such thing as a free lunch, the down side to this is that it can increase node failover times in a SQL Server failover cluster, in that when an instance starts up on a new node after failover, there is an added overhead in finding portions of memory contiguous enough to allocate 2Mb pages from.
No in-depth discussion about memory would be complete without mentioning the unit of transfer between main memory and the CPU, this is the cache line, I don’t think I need to add much to what is already on the slide other than to talk about associativity:
Associativity refers to the number of cache locations a cache line can be stored in, in a cache with two-way associativity this is two, four for four-way etc. This leads on to spinlocks, there is one particular spinlock I’m interested in, this is the LOGCACHE_ACCESS spinlock, used by threads when reserving space in the log buffer before copying logging information into it. This is a detailed overview of how this works:
If you are using Hekaton, the LOGCACHE_BUFFER spinlock is still used. The number of solutions to specific problems is often deterministic, therefore it should not come as a surprise that Oracle has a similar mechanism for serialising space reservation requests in its log buffer, this is called the redo allocation latch.
The LOGCACHE_ACCESS spinlock cache line is associated with the log writer thread which is always allocated to core 0. In order for a thread to obtain the spinlock, the cache line is released by the log writer thread, once the threads that wants to reserve space in the log buffer obtains the cache line, it writes 16 bytes of information to it.
In previous blog posts I have largely talked about CPU cache access latencies, there is however a thing which is the great white shark of performance killers; the context switch. A context switch can take up to 2000 CPU cycles to complete. A SQL OS scheduler softens the impact of context switches:
To complete the circle, spinlocks exist to mitigate against the overhead of context switches, it is more efficient in terms of wasted CPU cycles for a thread to perform a busy wait ( spin ) whilst waiting for a lock, than it is for the thread to be context switched out of the CPU be it physical or logical (SQL OS scheduler) and switched back in once the lock becomes available.
Back to the LOGCACHE_ACCESS spinlock, conceptually this is a very high level overview of the problem:
By removing core 0 from the CPU affinity mask the rest of database engine can be segregated from the log writer thread.
The rationale behind this is that if the log writer thread contends with as few other threads as possible on core 0, there is likely to be no 2000 CPU cycle overheads before the LOGCACHE_ACCESS spinlock cache line can be released, what we have is the perfect storm for releasing the spinlock as fast as possible.
I created a simple test to see what effect this has on the LOGCACHE_ACCESS spinlock, osstress from RML utilities was used to fire off 22 concurrent sessions against the database engine, 22 because with hyper threading enabled, 11 logical processors and hence schedulers are available after the removal of core 0 from the affinity mask, giving one thread per scheduler. Here are the results:
The result is an order of magnitude reduction in spins.
Again my apologies if my flow of postings comes across as being a bit disjointed, however I hope you have enjoyed reading this.