In this post I want to delve into the nuances of memory with an interesting little test, the results of which are counter intuitive to what most people might expect ;-). Along the way I will delve a bit more into CPU architectures and touch on hyper-threading and thread scheduling.
/* * Create 1095500000 rows */ WITH generator AS ( SELECT TOP 50000 id = ROW_NUMBER() OVER (ORDER BY a) FROM (SELECT a = 1 FROM master.dbo.syscolumns) c1 CROSS JOIN master.dbo.syscolumns c2 ) SELECT a.DateKey AS OrderDateKey ,CAST ( ( (id - 1) % 1048576 ) AS money ) AS Price1 ,CAST ( ( (id - 1) % 1048576 ) AS money ) AS Price2 ,CAST ( ( (id - 1) % 1048576 ) AS money ) AS Price3 INTO FactInternetSalesBig FROM generator CROSS JOIN dbo.DimDate a; CREATE CLUSTERED COLUMNSTORE INDEX ccsi ON FactInternetSalesBig; GO SELECT [CalendarQuarter] ,SUM([Price1]) ,SUM([Price2]) ,SUM([Price3]) FROM [dbo].[FactInternetSalesBig] f JOIN [DimDate] d ON f.OrderDateKey = d.DateKey GROUP BY [CalendarQuarter] OPTION (MAXDOP 24) GO
The SELECT statement at the end of excerpt above completes in 11426 ms from a warm large object cache. My test setup comprises of:
- Windows 2012
- SQL Server 2014 CU 2
- 2 x 6 core Xeons, clocked at 2Ghz (Sandybridge cores)
- 48Gb 1333Mhz non-EEC quad channel memory, evenly divided per CPU socket
- Hyper threading enabled, unless specified otherwise.
If we take a very similar scenario, with the exception that data is pre-sorted on the OrderDateKey column
/* * Create 1095500000 rows */ WITH generator AS ( SELECT TOP 50000 id = ROW_NUMBER() OVER (ORDER BY a) FROM (SELECT a = 1 FROM master.dbo.syscolumns) c1 CROSS JOIN master.dbo.syscolumns c2 ) SELECT a.DateKey AS OrderDateKey ,CAST ( ( (id - 1) % 1048576 ) AS money ) AS Price1 ,CAST ( ( (id - 1) % 1048576 ) AS money ) AS Price2 ,CAST ( ( (id - 1) % 1048576 ) AS money ) AS Price3 INTO FactInternetSalesBigSorted FROM generator CROSS JOIN dbo.DimDate a; CREATE CLUSTERED INDEX ccsi ON FactInternetSalesBigSorted( OrderDateKey ); CREATE CLUSTERED COLUMNSTORE INDEX ccsi ON FactInternetSalesBigSorted( OrderDateKey ) WITH (DROP_EXISTING = ON); GO SELECT [CalendarQuarter] ,SUM([Price1]) ,SUM([Price2]) ,SUM([Price3]) FROM [dbo].[FactInternetSalesBigSorted] f JOIN [DimDate] d ON f.OrderDateKey = d.DateKey GROUP BY [CalendarQuarter] OPTION (MAXDOP 24); GO
the last query takes 6060ms to run, the surprising thing is that FactInternetSalesBig is 1,798MB in size and FactInternetSalesBigSorted is 8,555MB in size.
To understand why we have obtained this ‘Unexpected’ result, lets look at CPU architectures and CPU caches. Most modern processors have a three level cache hierarchy. Xeon processors from Sandybridge onwards have a four level cache hierarchy, with level 0 being used to store decoded instructions. Since my last blog post I’ve produce a nicer picture of what a modern CPU looks like:
I mentioned that memory is nuanced, the Sandybridge architecture introduces a bi-directional ring bus connecting the level three cache to the CPU cores, the latency when accessing this depends on how many hops from core to core data needs to take. Below are the latencies In clock cycles when accessing different parts of the cache, note the latency when accessing main memory in particular:
The figures above relate to the i7 3960X Sandybridge-E processor.
The cost of main memory access gets worse with foreign/remote memory access in NUMA, according sysinternals coreinfo there is an extra 20% overhead when accessing foreign memory on my test server, in other words what costs 167 CPU cycles for local memory access becomes 201 cycles for remote/foreign main memory access:
When a CPU core requires data that it cannot be found in the CPU cache hierarchy, a “CPU stall” or a last level cache miss takes place ( often abbreviated to LLC miss ). SQLOS takes CPU caches into consideration when scheduling threads:
Whilst we want to minimise CPU stalls, there are some scenarios the stall times facilitate hyper threading in the Nehalem micro architecture onwards. The Nehalem(+) architecture schedules a second thread per core by using the CPU resources that are available when the main thread has experienced a CPU stall, the CPU cycles that take place during a stall are otherwise “Dead”. This is particularly useful for OLTP workloads, the random IO nature of which makes CPU stalls prevalent:
With hyper-threading enabled and a warm large object cache we can observe CPU utilisation as the degree of parallelism is increased for the SELECT statement using the ‘Sorted’ column store:
From this we can see that every odd degree of parallelism results in 60% core usage and that every increment in DOP by two consumes 5% of the total CPU capacity. With hyper-threading enabled it would make sense for SQLOS to schedule threads on a one thread per core basis where possible, a simple test with a DOP of six reveals that this is the exact case:
The picture above shows that all the six threads are running on NUMA node 1, because Windows schedules threads to run on node 0 on boot up, SQLOS swaps the nodes around and treats node 1 as its “Node 0”. The CSS Engineers blog refers to this as does Bob Ward during his SQLOS session at Pass last year.
With OLAP workloads CPU stalls should be minimised, but how do we achieve this ?. All modern Xeon processors are pipelined and they speak two different languages, executable images are made up of x86/64 IA (Intel Architecture) instructions which are ‘Complex’. What the CPU actually runs are “Micro instructions” abbreviated to UOPS. This allows the CPU to leverage RISC like tricks of more modern CPUs, 99.9% of all smartphones will run some form of ARM CPU derivative, these use RISC – reduced instruction set instructions, complex instructions ( CISC ) hark back to the days of mainframes and mini-computers. One of the key advantages in using micro-operations is that they lend them selves better to instruction pipelines than complex instructions. Modern Xeon processors have four of these pipe lines:
The “Pre-fetcher” keeps the front end busy by performing a main memory readahead into the CPU cache hierarchy. It achieves this by monitoring memory access patterns and ‘Predicting’ what to retrieve from main memory next. If you look at the advanced settings under your BIOS you will find the pre-fetcher settings somewhere.
The pipelines traverse the front and backends of the CPU, this is what they do at a very high level:
The life of the pre-fetcher is made easier if data structures are accessed in a predictable manner which lend themselves to sequential access. More simply put, you make the memory access patterns the pre-fetcher is monitoring as predictable as possible. Hash tables are particularly bad for sequential access, which is a bit of a problem considering that hash joins are the one critical join mechanism for OLAP style query workloads ( there is a clue in here which relates to the conundrum at the top of the post 😉 ). To investigate what is happening inside the CPU in terms of pressure on the front and back ends and cache utilisation, we need something that goes deeper than the standard tools furnished by Microsoft. Certain versions of Visual Studio allow CPU stall activity to be captured. However, we want more than just CPU stalls, where better to look for a tool than with the people who make the processor, introducing “Intel VTune Amplifier XE”:
Returning to the problem at hand; our simple query that aggregates three money columns around the CalendarQuarter column, this is how the queries with the sorted and non-sorted column stores scale as the degree of parallelism is increased:
The ‘Sorted’ column store index query achieves an elapsed time with a degree of parallelism of four, this is not that far off what is achieved with the “Non-sorted” column store and a degree of parallelism of 24. Below is the CPU utilisation per degree of parallelism for each column store type:
To prove that there is no IO at play, here is the wait activity captured for the query using the ‘Sorted’ column store at a degree of parallelism of 24:
It struck me as odd that the signal wait time is identical to the total wait time. I ran this past Thomas Kejser, his opinion was that for a short-lived wait on something like an uncontested spin lock this type of wait activity was to be expected.
Using the Intel profiling tool we can actually see where our CPU cycles are going, this is when we are using a full DOP of 24 and a warm column store object pool:
In summary, pre-sorting the data prior to column store index creation results in the number of CPU cycles used by CpagHashTable::SimpleLookup dropping dramatically, so much so that I’ve had to use the find utility in the profiler to locate it.
For completeness below are the actual execution plans for the two SELECT statements with a DOP of 24, the execution plans in tree form contains estimated costs:
Column store created on pre-sorted data
Column store created on non-pre-sorted data
Lets now take a look at CPU stall activity:
The total memory grant for the execution plan is 174152, however if we use the following query to look at the size per column per column store and average size per segment this reveals some interesting information which is hidden by the total size of the two different column store indexes:
SELECT o.name AS [Table name] ,cmns.name AS [Column Name] ,SUM(s.on_disk_size)/(1024*1024) AS [Size (Mb)] ,(SUM(s.on_disk_size) / COUNT(DISTINCT s.segment_id)) / 1024 AS [Avg Segment Size (Kb)] FROM sys.column_store_row_groups c JOIN sys.objects o ON o.object_id = c.object_id JOIN sys.partitions p ON p.object_id = o.object_id JOIN sys.column_store_segments s ON s.hobt_id = p.hobt_id JOIN sys.columns cmns ON cmns.object_id = o.object_id AND cmns.column_id = s.column_id WHERE o.name IN ( 'FactInternetSalesBig' ,'FactInternetSalesBigSorted') GROUP BY o.name ,cmns.name ORDER BY [Table name] ,[Column Name]
The hash table associated with the hash aggregate should fit into the CPU cache somewhere. Pre-sorting the data on the OrderDateKey column helps to achieve several things for the FactInternetSalesBigSorted clustered column store index:
- The segments associated with OrderDateKey column fit inside the L2/3 cache.
- By minimising the number of distinct OrderDateKey values per segment, when each thread scans an OrderDateKey column segment, this leads to memory access activity which is as “Pre-fetcher friendly” as possible.
- What was random memory access when probing the hash aggregate hash table now becomes sequential memory access.
These factors result in CPU cycle savings that outweigh the extra work involved in scanning in the segments for the Price1, Price2 and Price3 columns.
Granted this scenario is highly contrived, but I constructed it to illustrate the point that not all memory is equal, how we access memory has significant performance ramifications. In more practical terms, this example suggests that where possible data should always be pre-sorted on the column(s) containing values subject to the heaviest hash probe activity, be that on a hash aggregates or a hash join hash tables.
I have explained that the CPU architecture has a front and back-end, in my next post I will look at where the bottleneck falls on these different parts of the processor.