Anti Scale Patterns and The Repartition Streams Iterator In Row Mode

There is a type of behavior in the database engine which undermines scalability, this is when multiple threads contend for a single resource, contention on the page free space bit map is the example that most people will be familiar with. This image illustrates this anti scale pattern:

anti scale

This pattern inhibits scalability due to shared state, access to which requires synchronization via a latch:


or a spinlock:

Exchange iteratorThere are two potential solutions to this problem:

  • Create more of the resource being contended on
    In the case of the page free space bit map we create more data files.
  • Avoid using the construct that contains the “Anti scale pattern”

In the case of the re-partition streams iterator, a CROSS APPLY (an Adam Machanic technique) can be used to coerce the optimizer into using a nested loop join instead of the re-partition stream iterator. To test this we will require the AdventureWorks sample database and Adam Machanic’s make big adventure script, once we have our AdventureWorks database, bigProduct and bigTransactionHistory tables, a query is required that results in an execution plan containing the re-partition streams iterator:

SELECT  p.ProductId
       ,RANK() OVER (
            PARTITION BY p.ProductId
            ORDER     BY th.ActualCost DESC
        ) AS LineTotalRank
       ,RANK() OVER (
            PARTITION BY p.ProductId
            ORDER     BY th.Quantity DESC
        ) AS OrderQtyRank
FROM   bigProduct AS p
JOIN   bigTransactionHistory AS th
ON     th.ProductId = p.ProductId
WHERE  p.ProductId BETWEEN 1001 AND 20001

and here two of them are in the right most part of the plan:

repartition streams

According to the tool tip both are using hash as the mechanism to redistribute rows between threads:

tool tip

rewriting the query with a cross apply results in the re-partition streams iterators being replaced by a nested loop join:

SELECT       p.ProductId
FROM        (SELECT  TOP (2147483647) p0.ProductId
             FROM   bigProduct AS p0
             WHERE  p0.ProductId BETWEEN 1001 AND 20001
            ) AS p
CROSS APPLY (SELECT  th.TransactionId
                    ,RANK() OVER (
                         PARTITION BY p.ProductId
                         ORDER     BY th.ActualCost DESC
                     ) AS LineTotalRank
                    ,RANK() OVER (
                         PARTITION BY p.ProductId
                         ORDER     BY th.Quantity DESC
                     ) AS OrderQtyRank
FROM        bigTransactionHistory AS th
WHERE       th.ProductId = p.ProductId) AS d 


How do these two queries scale ?, firstly with a cold buffer cache:

graph cbc

and with a warm buffer cache:

graph wbc

At a degree of parallelism of 6 and a warm buffer cache the query using the cross apply is over twice as fast as the original query.

The theory is that replacing the re-partition streams exchange iterator with a nested loop join leads to better query scalability, this is the part of the stack associated with this iterator:exchange stackNote that the stack displays the spin lock ( X_PACKET_LIST ) which is used to synchronize access to the CX packet list structure, however the CPU time consumed by this is a fraction of that used by the database engine at the time the trace was captured; 82,546 ms for the entire database engine and 100 for the re-partition streams iterator.

Why Is The Re-Written Query More Efficient ?

The top ten most CPU intensive function for the row mode database engine run time are:

stack original

and this is the picture for the re-written query with the CROSS APPLY:

stack rewrite

Locating sortcmp in the call stack and then reversing it reveals all the functions that called sqlmin!sortcmp, unsurprisingly this has been called by the sort iterator:

reversed stack

Lets now drill down into the sorts in the two queries execution plans, firstly in the plan for the original query:

original plan actual costs

the sort that accounts for 42.5% of the plan cost has sorted 11,870,752 rows with an estimated row size of 89 bytes, totaling 1,008 MB of data, the next sort along accounts for 42.6% of the queries total cost and has a spill warning and has sorted the same number of rows with exact same average row size.

large sort

In the plan for the re-written query:

rewrite plan actual costs

the sort that immediately follows the index seek has sorted 11,870,752 rows, each with an estimated row size of 23 bytes, a total of 260 MB of data in total, the next sort along has sorted the exact same number of rows with the exact same estimated row size, totaling 260 MB in size also:

small sort


The CROSS APPLY is a T-SQL language construct of immense power, in this example by forcing the sort to take place on the rows retrieved from bigTransactionHistory before the join to the bigProduct table, two sorts of 1,008 MB have been turned into two sorts of 260 MB and have helped avoid a sort spill in the process. The key to this is the average row size, 23 bytes before the two tables are joined 89 bytes after. It is well documented that pushing filters into an execution plan as deeply as possible is highly desirable, in this example pushing sort iterators down into the plan via the CROSS APPLY has yielded a respectable performance gain at lows degrees of parallelism.

I set out writing this post with the expectation of being able to show the cost of the re-partition streams iterator, when in fact the facts proved the merits of sort iterator push down.

. . . to come in the next blog post, the answer to:


7 thoughts on “Anti Scale Patterns and The Repartition Streams Iterator In Row Mode

  1. Hi Chris,
    great blog post as always.
    Intersting thing is also how the sorts scales. SQL Server sorts data via Quick Sort (in memory) and via Merge Sort (on disk). Quick and Merge Sort complexity is O(n*log(n)). So, in this case if you’re sorting 11,870,752 rows in one piece, then the „cost” is 11870752 * log(11870752) = 193,369,660. CROSS APPLY pattern devides that one big sort on 9577 smaller sorts (becouse it’s the number of products). So, now the sort „cost” is 9577 * (11870752 / 9577) * log(11870752 / 9577) = 84,509,671 and this is aproximetly 2.3 times cheaper then sorting one big set.

    • Thanks for reading my blog and I’m glad you enjoy it. Referring to the tabular module / function call, the CPU time, sampled in ms across all cores for the sort in for the original query (sqlmin.dll::sortcmp) is 7986 ms compared 1,103 ms for the query with the CROSS APPLY , this is with a warm buffer cache and equates to an eight fold reduction in cpu time. This is the main reason I like using windows performance toolkit so much, because it allows these things to be quantified.

    • Hi Cody, I use windows performance toolkit, this comes as part of the windows assessment and planning toolkit. Firstly you need to obtain an ETW trace, there is a GUI tool for doing this, but I prefer using the xperf command line, to investigate where you CPU time is going issue the following command xperf.exe -on PROC_THREAD+LOADER+PROFILE -StackWalk Profile, run your workload then issues an xperf -d mytrace.etl. Load this into the windows performance analyzer gui, load your public symbols, there is no need to change the path they are downloaded from, it should point to the Microsoft public symbol server by default. Then drag the “CPU analysis (sampled)” graph onto the analysis canvass, right click on the column headings in the table, select function, module and call stack and you are good to go . . .

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s