In this series of blog posts, I’d like to elaborate on the material I intend to present at SQL bits, by essentially taking my slide deck and constructing a commentary around it. First I would like to touch on xperf, as I will use this to provide some unique insights into the database engine. Xperf comes as part of the Windows Performance Toolkit, which in turn ( and my apologies if this sounds confusing ) comes as part of the Windows Assesment and Deployment Kit. For me, the most interesting aspects of xperf is the means ir to stack walk the database engine.
Xperf can provide a 360 degree view of where your CPU time is going. But first of all, let me touch on performance tuning 101 via waits and queues. Because windows is a general purpose operating system; it was never designed with running a database engine in mind, part of the re-write of SQL Server 2005 introduced SQLOS. SQLOS is the lowest layer of the database engine, which, amongst many other things performs thread and memory management. The reason for the existence of SQLOS is twofold, firstly SQL Server understands its own thread scheduling requirements better than the operating system and secondly; layering software such that each layer encapsulates common functionality, the “OSI seven layer model” being a prime example of this, leads to software that is cleaner, easier to maintain and more robust.
SQLOS utilises CPU resources via schedulers, broadly speaking there is usually one schedule per thread that the underlying CPU architecture can process. Schedulers abstract away nuances of the hardware from SQL Server by dealing with NUMA and hyper-threading and secondly they allow SQL Server threads to be scheduled and processed without any context switches being incurred, unless the operating system itself pre-empts SQL Server ( refer to wait events prefixed by PREEMPTIVE_OS_ ) . Each scheduler has a runnable and waiting queue, when a thread is at the head of the runnable queue it can burn CPU cycles up to a quantum of 4 ms . Whenever a thread voluntarily yields a scheduler, be it through its 4ms time quantum being used up, the thread spinning on a spin lock or the threads CPU consumption being throttled by resource governor, time is accrued by the SOS_SCHEDULER_YIELD wait event.
This is where xperf comes into its own, it is possible to infer where your CPU time is going, but stack walking provides the most definitive means as to where your CPU cycles are going. Also, with the advent of in memory database technologies and low latency high IOPS bandwidth storage, the performance spot light is going to fall more and more on to where your CPU cycles are being expended.
From SQL Server 2012 onwards the database engine is divided into a set of DLLs, each of which deals with a specific task within the database engine, these are:
|Run time and storage engine||SQLMIN.dll|
|Query data store ( new to SQL 2014 )||QDS.dll|
|Thread scheduling, processor hardware abstraction and memory management||SQLDK.dll|
I’ll dive into walking an actual stack in my next blog post, however, it is not hard to see that if we know what the dll composition of sql server is, taking sqllang as an example, we can do things such as quantify the cost of language processing. The image below taken from one of my slide decks provides a “Mind map” of how a stack trace of the database engine is obtained.
On to the specifics of batch mode and column store indexes. Most people will be familiar with Moore’s law as illustrated in the slide below.
The upshot of this is that there has been a disparity between storage and CPU technology in terms of storage sub systems being able to keep CPU cores busy. Compression has been seen as the answer this, i.e. we make tables and indexes smaller, resulting in fewer pages to transfer from disk to memory, and whilst our CPUs are idling, the spare CPU cycles can be used to decompress the data.
The notion of the latency when accessing the storage sub system being an order of magnitude higher than that of accessing main memory has influenced the design of SQL Server heavily, this is why the database engine can perform read-aheads and why there is the concept of “Write behind” with the buffer cache and the lazy writer / checkpoints. In addition to this there are hard-coded assumptions built into the optimizer. For example, many of its base costs are derived from a Dell Optiplex used by one of the SQL Server teams developers ( something Thomas Kejser has referred to as “The machine under Lubor’s desk” in one of his posts ), it is safe to say that this machine used “Spinning rust” style hard drives. The optimizer assumes that the storage sub system cannot support multiple IO operations in parallel, which is why IO costs are not taken into account when the optimizer determines whether to compile T-SQL statements with serial or parallel plans.
Focussing on the subject of latency in more depth, this next slide illustrates the latency penalties incurred when accessing different levels of the memory and storage sub system hierarchy.
The processors in my test rig ( Sandybridge core based Xeons ) can expend up to 160 CPU cycles when accessing main memory. The images that run across the bottom of the slide provide an analogy as to the time it takes to access the CPU caches, main memory etc . It is said that CPU cache is the new flash and flash the new memory. CPU designers put a great deal of effort into cache design and the algorithms behind pre-fetching data and instructions into them in order to minimise the penalty of wasted CPU cycles when accessing main memory ( let alone disk ). What the slide does not illustrate, is a recent development by Intel to include a 128Mb level cache onto the packaging of their Haswell core based processors ( code-named “Crystalwell” .
As you might expect “Dropping out” of the different caches in the memory hierarchy impacts throughput:
In an ideal world, we want all memory access to be confined to the on chip L1/2/3 ( and now 4 ) caches, however even this is nuanced, in that pages accessed sequentially within the CPU cache yields superior pages served / second throughput compared to random page access:
The final point to consider is how the execution engine runs queries, we all know that cursors and row by agonising row processing are bad for performance. Unfortunately it just so happens that the execution engine behaves like one great big cursor. With the exception of parallel iterators which have the concept of a consumer and producer side, in which data is pushed from the producer side ( logical right ) to the consumer side ( logical left ), data is pulled through the plan from left to right. Paul White ( @sqlkiwi ) has referred to this as the “On demand iterator pattern”:
This is bad for performance and scalability in terms of the number of machine instructions per row that have to be executed and the random CPU cache page accesses this causes.
In summary, this initial blog post has covered:
- An overview of xperf and why I like to use it.
- The “Hardware” CPU consumption rate / storage sub-system throughput gap “Bagage” that has plagued performance and scalability.
- The performance and latency penalties we pay as we drop down the CPU cache / memory hierarchy.
- The inherent inefficiency in query execution.
Because compression is used specifically to bridge the CPU consumption storage sub-system throughput gap, in my next posting I will cover how sequential scans behave on slow spinning disk based storage versus fast flash based storage and quantify the CPU cycle cost of compressed object decompression and batch versus row mode iterators. Along the way I will also cover some column store fundamentals and what SQL Server 2014 introduces that is new to the “Column store / batch processing” party.
Lastly, many thanks to Thomas Kejser to kindly letting me borrow material from his one day SQL Server Master Tuning session.