In this blog post I wanted to distill down the most fundamental points to consider when attempting to process a SQL Server workload in a scale-able manner. However, many of the principles I will outline can be applied to any kind of data processing platform.
Shared State And Multi-Threaded Workloads
Shared state is the most fundamental bottleneck when attempting to scale any multi-threaded workload. More simply put, threads that involve the processing of share stated have to access the shared state via some form of synchronization mechanism. Examples of this in the world of the database engine are legion:
- The last page problem encountered by concurrent insert into heaps having indexes and clustered indexes with monotonically increasing keys.
- The parallel page enumerator encountered when attempting to scan the same non-partitioned object in parallel.
- The LOGCACHE_ACCESS spin lock that synchronizes access to the log buffer.
- The space management bitmaps from which the best practice of avoiding one data file for tempdb arises from.
- The re partition streams iterator in execution plans.
- etc etc etc
This can be summarised in this one single graphic:
As “A picture is worth a thousand” words, this conceptual image of the re-partition streams iterator illustrates how it relates to the shared state problem:
The Final Hardware Bottleneck Is Modern CPU Architecture
Joel Spolsky co-founder of stack exchange coined the term “The law of leaky abstractions”, enumerated in this blog post. The premise of this, to quote Joel is:
All non-trivial abstractions, to some degree, are leaky.
In the context of the modern CPU this means when we push the database engine hard, nuances of CPU behaviour “Stick their head” through the abstraction layer of SQL OS and become apparent in the way that the workload behaves. The modern CPU is composed of two distinct areas:
- CPU Cores – the fundamental processing building blocks
- ‘Uncore’ – this provides support services to the cores, such as the level 3 cache, main memory management unit, PCIe controllers, quick path interconnect controllers and more.
The facets of the CPU this applies to are:
- The memory on the CPU
Or the CPU cache hierarchy to use its more grandiose name, specifically the fact that when workloads spill out of this, a huge performance penalty is paid. For throughput this gives rise to the term “Cache out curve” of which there are many:
All execution plans iterators that require memory grants have two fundamental code paths, one path for when the memory grant is blown and memory spills out into tempdb and one for when the memory grant is correct or under-estimated. Perhaps the database engine team may at some point include a third option, which is for when the grant can be accommodated inside the CPU cache.
As an example, if you run a log record generation intensive workload on the same CPU socket as the log writer, usually socket 0, this will run in a shorter time compared to running the exact same workload in a different socket:
The reason for this is because acquiring key spin locks associated with the log writer involves burning more CPU cycles when transferring memory cache lines from socket to socket compared to CPU core to CPU core when everything runs on the same socket.
For illustrative purposes this image denotes the performance penalties that are paid for going “Off socket”:
- The CPU Pre-Fetcher
This entity monitors main memory access with the aim of detecting main memory access patterns in order to pre-fetch what it thinks threads running on the CPU cores require next. The ultimate aim of the pre-fetcher is to minimise costly trips out to main memory. To illustrate this, consider the example below:
Contrary to conventional wisdom, the test using the larger column store on the right runs faster than the test using the smaller column store on the left. This is because the act of pre-sorting the data leads to sequential memory access when probing the hash aggregate hash table of the iterator that immediately follows the column store scan. Sequential memory access creates the perfect conditions for the pre-fetcher to do its job.
Someone I know met one of the original DB2 optimizer architects at IBM. To put this into context how complex an optimizer is, there are only a handful of people in the world who can design one from scratch. I asked my friend: “What did this guy have to say ?”, the answer, rather simplistically put was that sorting was dead. Its a gross over simplification to suggest any database engine or data platform can do away with sorting, however the point is that sorting is one of the worst operations there is for confusing the CPU pre-fetcher and preventing it from doing its job. Unfortunately there are a lot of things the database engine has to do which are not CPU pre-fetcher friendly, such as hashing and pointer chasing when seeking down an index b-tree.
- The Translation Look-aside Buffer (TLB)
This resides inside the on CPU-die memory management unit and stores part of the virtual memory page table. This is the core concept behind why large memory pages can lead to an increase in performance. In essence larger pages mean the the TLB can cover more memory and improved performance unless it is being thrashed. In the example below using a simple ring buffer and the in-memory engine and a memory optimised table with a hash index, the use of large pages results in no improvement after four threads because it is being thrashed.
Twice The CPU Resource Does Not Always Translate Into Twice The Throughput
Due to overheads associated with spin-locking and latching, doubling the CPU resource a workload can use with the legacy engine does not necessarily translate into double the throughput:
The Two Techniques For Relieving Synchronization Primitive Bottlenecks
When faced with the bottleneck of contention around a synchronisation primitive, such as a latch, there are two solutions. Consider a parallel scan of a heap, here we can see that there is contention on the ACCESS_METHODS_DATASET_PARENT latch:
Using the analogy of this latch protecting the part of the database that deals out page ranges for the next “Sub scan” being similar to a card dealer at a table, the resolution is to create more “Card dealers” by (hash) partitioning the heap:
The other alternative when faced with contention on a single synchronisation primitive in the form of a latch, is to replace the way the workload is processed by a method that avoids latching full stop. Take for example the implementation of a ring buffer from the previous blog post using the in-memory engine. The naive approach to determining which slot in the ring buffer to pop/push messages into/from is to use a sequence object, this leads to contention on the latch that synchronises access to that object:
The solution to this is to ‘Craft’ a lock and latch free sequence, this is done using a memory optimized table with a single identity column. A sequence value is obtained by inserting a row into the table, getting the identity value using SCOPE_IDENTITY() and then deleting the row, so that the table does not grow to be an in-ordinate size. The proof of the pudding is in the increase in performance that our non-blocking sequence objects results in:
I don’t want to give the impression that I intend to stop blogging, as this is absolutely not the case. However, I have bombarded my readers with endless graphs and CPU diagrams, in this post my hope was to distill down the key tenets that underpin the principles of getting a SQL Server (or any data platform for that matter) to process workloads in a scale-able manner.