In this post I will demonstrate how a neat trick brought to my attention by Niko Neugebauer can turn the processing of windowing functions from “Anti-scale” to scale-ability. First of all we need to create some test data:
SELECT a.object_id ,SUBSTRING(CONVERT(VARCHAR(40), NEWID()), 1, 3) AS code FROM sys.all_objects a CROSS JOIN sys.all_objects b CROSS JOIN (SELECT TOP 200 * FROM sys.all_objects b) c INTO TestData WHERE a.type = 'P' AND b.type = 'P' UNION ALL SELECT a.object_id ,SUBSTRING(CONVERT(VARCHAR(40), NEWID()), 1, 3) AS code FROM sys.all_objects a CROSS JOIN sys.all_objects b WHERE a.type = 'U' AND b.type = 'U' UNION ALL SELECT a.object_id ,SUBSTRING(CONVERT(VARCHAR(40), NEWID()), 1, 3) AS code FROM sys.all_objects a CROSS JOIN sys.all_objects b WHERE a.type = 'V' AND b.type = 'V' CREATE CLUSTERED INDEX csx ON TestData (object_id, code)
Secondly a query which includes a windowing function is required:
SELECT object_id ,code ,RANK() (PARTITION BY t.object_id ORDER BY code ASC) AS rk FROM TestData t
This is the resulting execution plan for the query:
This graph depicts the elapsed times that we get as the degree of parallelism is increased:
To cut to chase we get “Anti-scale”. SQL Server 2016 has a batch window aggregation iterator.
A Recap On Batch Execution Mode
The SQL Server execution engine fundamentally acts like a cursor, control flow is exerted from the root node of the plan down to right most child node or iterators. The (logical) flow of data through the plan is in the opposite direction.I say ‘Logical’ because in practise the run time uses buffers in order to minimise the data movement whilst executing the plan. However, up until SQL Server 2012, which first introduced batch mode, execution plans were executed by iterators processing data in a row by agonising row manner. Batch mode changes all of this, if we can ferry rows around in batches, this reduces the number of CPU cycles it takes to process an iterator in the plan (providing it supports batch mode). Also by sizing batches such that they fit inside the level 2 cache of the CPU, we gain even more performance by minimizing CPU last level cache misses or worse still main memory.
Leveraging Batch Mode With Row Stores
It happens that there is a neat trick that Niko Neugebauer has made me aware of, this is to create a table, leave it empty, create a column store index on that table and then left outer join this into your original query. Using this technique, our query now looks something like this:
CREATE TABLE [dbo].[DummyTable] ( [object_id] [int] NULL ) CREATE CLUSTERED COLUMNSTORE INDEX ccsi ON [dbo].[DummyTable] SELECT t.object_id ,code ,RANK() OVER (PARTITION BY t.object_id ORDER BY code ASC) AS rk FROM TestData t LEFT JOIN DummyTable d ON d.object_id = t.object_id OPTION (HASH JOIN)
This is the resulting graph:
Crucially, we have gone from anti-scale to a query that is now scale-able. By digging into the execution plan we can see that we are now getting batch mode iterators:
What About SQL Server 2017 ?
I have been lead to believe from reliable sources that batch mode has been greatly enhanced in SQL Server 2017 and this is something that I intend to dig into. I have already demonstrated that if a column store is created on data that is pre-sorted on the column used for joins on smaller tables, that this results in CPU last level caches dropping through the fall. There is a wealth of research papers on hash join strategies designed to get around last level cache misses, such as hash joins with software-pre-fetchers, and I wonder if this is what Microsoft have used.
Creating A Columnstore On The TestData Table
Lets repeat the same test again, but this time with a column store on the TestData table:
The bottom line here is that the addition of the column store does not make any difference.
Takeaways
To summarise this post, we have seen how the ‘Neugebauer’ technique can be leveraged in order to make queries using row store object execute in batch mode. In this specific case it has improved an execution plan by:
- The introduction of batch mode
- The introduction of batch mode window aggregate functions
- The elimination of a re-partition streams iterator
The final point is that the use of a column store has made little or no difference to the performance of the query, implying that it is batch mode and the removal of the re-partition streams iterator that is doing the heavy lifting.
Thanks for very interesting information.
I think there is a minor mistake in graph “Elapsed time (ms) / Degree of Parallelism”. According to captions row mode is better scalable than batch mode.
You are correct, I will fix the captions.
Fixed, many thanks for pointing this error out.