There is a type of behavior in the database engine which undermines scalability, this is when multiple threads contend for a single resource, contention on the page free space bit map is the example that most people will be familiar with. This image illustrates this anti scale pattern:
This pattern inhibits scalability due to shared state, access to which requires synchronization via a latch:
or a spinlock:
There are two potential solutions to this problem:
- Create more of the resource being contended on
In the case of the page free space bit map we create more data files. - Avoid using the construct that contains the “Anti scale pattern”
In the case of the re-partition streams iterator, a CROSS APPLY (an Adam Machanic technique) can be used to coerce the optimizer into using a nested loop join instead of the re-partition stream iterator. To test this we will require the AdventureWorks sample database and Adam Machanic’s make big adventure script, once we have our AdventureWorks database, bigProduct and bigTransactionHistory tables, a query is required that results in an execution plan containing the re-partition streams iterator:
SELECT p.ProductId ,p.ProductNumber ,p.ReorderPoint ,th.TransactionId ,RANK() OVER ( PARTITION BY p.ProductId ORDER BY th.ActualCost DESC ) AS LineTotalRank ,RANK() OVER ( PARTITION BY p.ProductId ORDER BY th.Quantity DESC ) AS OrderQtyRank FROM bigProduct AS p JOIN bigTransactionHistory AS th ON th.ProductId = p.ProductId WHERE p.ProductId BETWEEN 1001 AND 20001
and here two of them are in the right most part of the plan:
According to the tool tip both are using hash as the mechanism to redistribute rows between threads:
rewriting the query with a cross apply results in the re-partition streams iterators being replaced by a nested loop join:
SELECT p.ProductId ,p.ProductNumber ,p.ReorderPoint ,d.* FROM (SELECT TOP (2147483647) p0.ProductId ,p0.ProductNumber ,p0.ReorderPoint FROM bigProduct AS p0 WHERE p0.ProductId BETWEEN 1001 AND 20001 ) AS p CROSS APPLY (SELECT th.TransactionId ,RANK() OVER ( PARTITION BY p.ProductId ORDER BY th.ActualCost DESC ) AS LineTotalRank ,RANK() OVER ( PARTITION BY p.ProductId ORDER BY th.Quantity DESC ) AS OrderQtyRank FROM bigTransactionHistory AS th WHERE th.ProductId = p.ProductId) AS d
How do these two queries scale ?, firstly with a cold buffer cache:
and with a warm buffer cache:
At a degree of parallelism of 6 and a warm buffer cache the query using the cross apply is over twice as fast as the original query.
The theory is that replacing the re-partition streams exchange iterator with a nested loop join leads to better query scalability, this is the part of the stack associated with this iterator:Note that the stack displays the spin lock ( X_PACKET_LIST ) which is used to synchronize access to the CX packet list structure, however the CPU time consumed by this is a fraction of that used by the database engine at the time the trace was captured; 82,546 ms for the entire database engine and 100 for the re-partition streams iterator.
Why Is The Re-Written Query More Efficient ?
The top ten most CPU intensive function for the row mode database engine run time are:
and this is the picture for the re-written query with the CROSS APPLY:
Locating sortcmp in the call stack and then reversing it reveals all the functions that called sqlmin!sortcmp, unsurprisingly this has been called by the sort iterator:
Lets now drill down into the sorts in the two queries execution plans, firstly in the plan for the original query:
the sort that accounts for 42.5% of the plan cost has sorted 11,870,752 rows with an estimated row size of 89 bytes, totaling 1,008 MB of data, the next sort along accounts for 42.6% of the queries total cost and has a spill warning and has sorted the same number of rows with exact same average row size.
In the plan for the re-written query:
the sort that immediately follows the index seek has sorted 11,870,752 rows, each with an estimated row size of 23 bytes, a total of 260 MB of data in total, the next sort along has sorted the exact same number of rows with the exact same estimated row size, totaling 260 MB in size also:
Takeaway
The CROSS APPLY is a T-SQL language construct of immense power, in this example by forcing the sort to take place on the rows retrieved from bigTransactionHistory before the join to the bigProduct table, two sorts of 1,008 MB have been turned into two sorts of 260 MB and have helped avoid a sort spill in the process. The key to this is the average row size, 23 bytes before the two tables are joined 89 bytes after. It is well documented that pushing filters into an execution plan as deeply as possible is highly desirable, in this example pushing sort iterators down into the plan via the CROSS APPLY has yielded a respectable performance gain at lows degrees of parallelism.
I set out writing this post with the expectation of being able to show the cost of the re-partition streams iterator, when in fact the facts proved the merits of sort iterator push down.
. . . to come in the next blog post, the answer to:
Excellent stuff!
Hi Chris,
great blog post as always.
Intersting thing is also how the sorts scales. SQL Server sorts data via Quick Sort (in memory) and via Merge Sort (on disk). Quick and Merge Sort complexity is O(n*log(n)). So, in this case if you’re sorting 11,870,752 rows in one piece, then the „cost” is 11870752 * log(11870752) = 193,369,660. CROSS APPLY pattern devides that one big sort on 9577 smaller sorts (becouse it’s the number of products). So, now the sort „cost” is 9577 * (11870752 / 9577) * log(11870752 / 9577) = 84,509,671 and this is aproximetly 2.3 times cheaper then sorting one big set.
Thanks for reading my blog and I’m glad you enjoy it. Referring to the tabular module / function call, the CPU time, sampled in ms across all cores for the sort in for the original query (sqlmin.dll::sortcmp) is 7986 ms compared 1,103 ms for the query with the CROSS APPLY , this is with a warm buffer cache and equates to an eight fold reduction in cpu time. This is the main reason I like using windows performance toolkit so much, because it allows these things to be quantified.
How did you generate that stack display?
Hi Cody, I use windows performance toolkit, this comes as part of the windows assessment and planning toolkit. Firstly you need to obtain an ETW trace, there is a GUI tool for doing this, but I prefer using the xperf command line, to investigate where you CPU time is going issue the following command xperf.exe -on PROC_THREAD+LOADER+PROFILE -StackWalk Profile, run your workload then issues an xperf -d mytrace.etl. Load this into the windows performance analyzer gui, load your public symbols, there is no need to change the path they are downloaded from, it should point to the Microsoft public symbol server by default. Then drag the “CPU analysis (sampled)” graph onto the analysis canvass, right click on the column headings in the table, select function, module and call stack and you are good to go . . .
You can check this defragtools show on CPU analysis from channel 9, it tells you all you need to know https://channel9.msdn.com/Shows/Defrag-Tools/Defrag-Tools-42-WPT-CPU-Analysis.