The batch mode engine as it stands has delivered an order of magnitude performance increase over the row mode engine. In this post ( and please bear with me, as this is genuine level 400 stuff ), I will outline which part of the CPU is coming under pressure and what should be leveraged to achieve the next step change in performance.
I will start with something people might be familiar with; the path of data from physical disk to server. One of the aims of the fast track methodology is to come up with hardware architecture that guarantee a sustained throughput of data from physical disk to CPU core that helps to avoid “CPU starvation”:
Each hardware component in the chain above has to be specified such that we get the desired and sustained throughput from one end of the ‘Pipe’ to the other. The same principle applies to the world of in-memory:
I gave a very high level overview of what the front and back ends do in the last post, here is a more detailed one:
I mentioned in the previous post that micro operations get processed in pipelines that pass through the front and back ends of the processor. Intel VTune Amplifier XE provides many processor key performance indicators, this post will focus on three:
- Clock cycles per instruction
Modern Xeon processors can issue four instructions per clock cycle, ( four because there are four pipelines ) the theoretical best for this KPI is 0.25.
- Front end Bound
The CPU is “Front end bound” when it is delivering less than 4 uops per cycle and the back end is ready to accept uops, this figure should ideally be as small as possible.
- Back end Bound
“Back end” bound behavior takes place when the rate of uops retired is low due to pressure on resources at the back end of the pipeline, ‘Resources’ include things such as execution units, long latency operations can also cause pressure at the back end.
This information comes from the “2nd CERN Advanced Tuning Workshop”. Because I dislike metrics being used without any formal explanation of how they are derived, here are the formulas these KPIs:
- Clock Cycles Per (Retired) Instruction
- Clock ticks / UOPS.RETIRED.RETIRED_SLOTS
- Front end bound ( smaller is better )
- IDQ_NOT_DELIVERED.CORE / (4 * Clock ticks)
- Back end bound ( ideally, should = 1 – Retiring)
- 1 – (Front end bound + Bad speculation + Retiring)
I don’t expect many people to instantaneously grasp how these formulas work, however, at the same time, I don’t want anyone to think that I have plucked these out of thin air. Returning to the following query and a warm column store object pool:
SELECT [CalendarQuarter] ,SUM([Price1]) ,SUM([Price2]) ,SUM([Price3]) FROM [dbo].[FactInternetSalesBigSorted] f JOIN [DimDate] d ON f.OrderDateKey = d.DateKey GROUP BY [CalendarQuarter] OPTION (MAXDOP 24);
we can see how these three KPIs vary as we increase the degree of parallelism:
To take a step back, performance tuning is like playing whack-a-mole, you get rid of a bottleneck in one particular place only for another to appear elsewhere. The whole column store / batch processing architecture has resulted in a SQL run time engine which makes highly efficient use of the front end through leveraging the CPU caches, but in doing so it has now shifted the load and pressure to the back end.
The CPU back end has a number of execution units, pressure on the back end can manifest as “Port saturation”. Back end execution units are accessed via ports, “Port saturation” refers to excessive demands made on a port or ports. Throughout this series I have made reference to the Sandybridge micro architecture, this is because its probably the most prevalent one in production, Ivybridge is a die shrink of Sandybridge, therefore it is fundamentally the same apart from the geometry it is fabricated in and Haswell ( as far as I know ) is not yet slated for release in Xeon form until the end of this year. This is what the Sandybridge scheduler looks like:
The Intel profiling tool can monitor scheduler port activity, if 70% or more of the CPU cycles are spent on requests for a specific port, Intel deem this as being a cause for concern.
Considering that the peak CPU utilization on the test server was 59%, if we can uncover whatever bottleneck is throttling the CPU consumption back, its possible that the figures in the graph above will move into the realms of port saturation.
How is this alleviated ?, the generic term for processing data in sets is known as
vector-ised processing, the batch engine is essentially a vector processing engine. The problem is that the vector-ised processing is taking place in terms of the way the CPU caches are used, but not in the way that the back end of the CPU is utilised. In CPU terms, Back end vector-isation is utilised via ‘SIMD’, “Single instruction multiple data”, these are machine instructions that allow a single instruction to execute against data held in multiple registers simultaneously:
Whilst running a query in batch mode, VTune Amplifier can be used to detect whether a CPU is using any SIMD type instructions, sadly this does not appear to be the case with the database engine in its current form:
SIMD instructions have been around in the Intel processor line up since the Pentium II in the form of multi media extensions (MMX). SIMD is represented in the current Intel road map by advanced vector extensions (AVX) which first appeared in the Sandybridge micro architecture, it caters for 256 bit wide vectors and the vector-isation of floating point numbers. The Haswell micro architecture introduces more flexibility through integer SIMD extensions and Skylake increase the size of the vector to 512 bytes.
Returning to the profiler, CBagAggregateExpression::TryAggregateUsingQE_Pure is the top consumer of CPU cycles:
By reducing the number of distinct OrderDateKey values per row group, the engine could conceivably scan down the row group segments aggregating four money values per instruction. The money data type is 8 bytes wide allowing the storage of 4 money in one 256 bit vector.
The concept of leveraging SIMD is not new to the database world or that of SQL Server. Joe Chang has blogged on the subject, Microsoft research have published a white paper on the subject, IBM has leveraged this technology in their BLU accelerator for DB2 and SAP HANA uses a fixed length encoding scheme specifically to cater for AVX as per the excerpt below I found in this article:
“We do use vector processing, and we are actually working on compressed data a majority of the time,” says Hallenbeck. “So HANA looks at what version of the processor is running, and in this case, AVX gave us a lot of new things we could do in the system to make it faster. We use fixed-length encoding in our compression scheme. By going from 128-bit to 256-bit, we can double the instruction rate per cycle, but it gives us an additional set of optimizations on top of that. Plus, we have a lot more addressable memory that is local to individual cores, and we have more cores.”
By aligning the granularity of compressed data to AVX register size, AVX can be utilised without having to split up or concatenate encoded run lengths.
In summary this is the next step that Microsoft need to take, the figures I have been able to pull out of VTune Amplifier XE support this. I am not privy to any Microsoft SQL Server road maps, however I strongly suspect there will be a SQL Server 2016 and I’m wondering if its database engine will be able to leverage SIMD, I really hope that it does.