Scale-able Windows Aggregate Functions With Row Store Objects

In this post I will demonstrate how a neat trick brought to my attention by Niko Neugebauer can turn the processing of windowing functions from “Anti-scale” to scale-ability. First of all we need to create some test data:

SELECT      a.object_id
           ,SUBSTRING(CONVERT(VARCHAR(40), NEWID()), 1, 3) AS code
FROM        sys.all_objects a
CROSS JOIN  sys.all_objects b
CROSS JOIN (SELECT TOP 200 *
            FROM   sys.all_objects b) c
INTO        TestData
WHERE       a.type = 'P'
AND         b.type = 'P'
UNION ALL
SELECT      a.object_id
           ,SUBSTRING(CONVERT(VARCHAR(40), NEWID()), 1, 3) AS code
FROM        sys.all_objects a
CROSS JOIN  sys.all_objects b
WHERE       a.type = 'U'
AND         b.type = 'U'
UNION ALL
SELECT      a.object_id
           ,SUBSTRING(CONVERT(VARCHAR(40), NEWID()), 1, 3) AS code
FROM        sys.all_objects a
CROSS JOIN  sys.all_objects b
WHERE       a.type = 'V'
AND         b.type = 'V'

CREATE CLUSTERED INDEX csx ON TestData (object_id, code)

Secondly a query which includes a windowing function is required:

SELECT  object_id
,code
,RANK() (PARTITION BY t.object_id ORDER BY code ASC) AS rk
FROM    TestData t

This is the resulting execution plan for the query:

Untitled

This graph depicts the elapsed times that we get as the degree of parallelism is increased:

Untitled

To cut to chase we get “Anti-scale”. SQL Server 2016 has a batch window aggregation iterator.

A Recap On Batch Execution Mode

The SQL Server execution engine fundamentally acts like a cursor, control flow is exerted from the root node of the plan down to right most child node or iterators. The (logical) flow of data through the plan is in the opposite direction.Untitled.pngI say ‘Logical’ because in practise the run time uses buffers in order to minimise the data movement whilst executing the plan. However, up until SQL Server 2012, which first introduced batch mode, execution plans were executed by iterators processing data in a row by agonising row manner. Batch mode changes all of this, if we can ferry rows around in batches, this reduces the number of CPU cycles it takes to process an iterator in the plan (providing it supports batch mode). Also by sizing batches such that they fit inside the level 2 cache of the CPU, we gain even more performance by minimizing CPU last level cache misses or worse still main memory.

Leveraging Batch Mode With Row Stores

It happens that there is a neat trick that Niko Neugebauer has made me aware of, this is to create a table, leave it empty, create a column store index on that table and then left outer join this  into your original query. Using this technique, our query now looks something like this:

CREATE TABLE [dbo].[DummyTable] (
    [object_id] [int] NULL
)
CREATE CLUSTERED COLUMNSTORE INDEX ccsi ON [dbo].[DummyTable]

SELECT     t.object_id
          ,code
          ,RANK() OVER (PARTITION BY t.object_id ORDER BY code ASC) AS rk
FROM      TestData t
LEFT JOIN DummyTable d
ON        d.object_id = t.object_id
OPTION (HASH JOIN)

 This is the resulting graph:

Untitled

Crucially, we have gone from anti-scale to a query that is now scale-able. By digging into the execution plan we can see that we are now getting batch mode iterators:

Untitled.png

What About SQL Server 2017 ?

I have been lead to believe from reliable sources that batch mode has been greatly enhanced in SQL Server 2017 and this is something that I intend to dig into. I have already demonstrated that if a column store is created on data that is pre-sorted on the column used for joins on smaller tables,  that this results in CPU last level caches dropping through the fall. There is a wealth of research papers on hash join strategies designed to get around last level cache misses, such as hash joins with software-pre-fetchers, and I wonder if this is what Microsoft have used.

Creating A Columnstore On The TestData Table

Lets repeat the same test again, but this time with a column store on the TestData table:

Untitled

The bottom line here is that the addition of the column store does not make any difference.

Takeaways

To summarise this post, we have seen how the ‘Neugebauer’ technique can be leveraged in order to make queries using row store object execute in batch mode. In this specific case it has improved an execution plan by:

  • The introduction of batch mode
  • The introduction of batch mode window aggregate functions
  • The elimination of a re-partition streams iterator

The final point is that the use of a column store has made little or no difference to the performance of the query, implying that it is batch mode and the removal of the re-partition streams iterator that is doing the heavy lifting.

4 thoughts on “Scale-able Windows Aggregate Functions With Row Store Objects

  1. Thanks for very interesting information.
    I think there is a minor mistake in graph “Elapsed time (ms) / Degree of Parallelism”. According to captions row mode is better scalable than batch mode.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s