There is a type of behavior in the database engine which undermines scalability, this is when multiple threads contend for a single resource, contention on the page free space bit map is the example that most people will be familiar with. This image illustrates this anti scale pattern:
This pattern inhibits scalability due to shared state, access to which requires synchronization via a latch:
or a spinlock:
- Create more of the resource being contended on
In the case of the page free space bit map we create more data files.
- Avoid using the construct that contains the “Anti scale pattern”
In the case of the re-partition streams iterator, a CROSS APPLY (an Adam Machanic technique) can be used to coerce the optimizer into using a nested loop join instead of the re-partition stream iterator. To test this we will require the AdventureWorks sample database and Adam Machanic’s make big adventure script, once we have our AdventureWorks database, bigProduct and bigTransactionHistory tables, a query is required that results in an execution plan containing the re-partition streams iterator:
SELECT p.ProductId ,p.ProductNumber ,p.ReorderPoint ,th.TransactionId ,RANK() OVER ( PARTITION BY p.ProductId ORDER BY th.ActualCost DESC ) AS LineTotalRank ,RANK() OVER ( PARTITION BY p.ProductId ORDER BY th.Quantity DESC ) AS OrderQtyRank FROM bigProduct AS p JOIN bigTransactionHistory AS th ON th.ProductId = p.ProductId WHERE p.ProductId BETWEEN 1001 AND 20001
and here two of them are in the right most part of the plan:
According to the tool tip both are using hash as the mechanism to redistribute rows between threads:
rewriting the query with a cross apply results in the re-partition streams iterators being replaced by a nested loop join:
SELECT p.ProductId ,p.ProductNumber ,p.ReorderPoint ,d.* FROM (SELECT TOP (2147483647) p0.ProductId ,p0.ProductNumber ,p0.ReorderPoint FROM bigProduct AS p0 WHERE p0.ProductId BETWEEN 1001 AND 20001 ) AS p CROSS APPLY (SELECT th.TransactionId ,RANK() OVER ( PARTITION BY p.ProductId ORDER BY th.ActualCost DESC ) AS LineTotalRank ,RANK() OVER ( PARTITION BY p.ProductId ORDER BY th.Quantity DESC ) AS OrderQtyRank FROM bigTransactionHistory AS th WHERE th.ProductId = p.ProductId) AS d
How do these two queries scale ?, firstly with a cold buffer cache:
and with a warm buffer cache:
At a degree of parallelism of 6 and a warm buffer cache the query using the cross apply is over twice as fast as the original query.
The theory is that replacing the re-partition streams exchange iterator with a nested loop join leads to better query scalability, this is the part of the stack associated with this iterator:Note that the stack displays the spin lock ( X_PACKET_LIST ) which is used to synchronize access to the CX packet list structure, however the CPU time consumed by this is a fraction of that used by the database engine at the time the trace was captured; 82,546 ms for the entire database engine and 100 for the re-partition streams iterator.
Why Is The Re-Written Query More Efficient ?
The top ten most CPU intensive function for the row mode database engine run time are:
and this is the picture for the re-written query with the CROSS APPLY:
Locating sortcmp in the call stack and then reversing it reveals all the functions that called sqlmin!sortcmp, unsurprisingly this has been called by the sort iterator:
Lets now drill down into the sorts in the two queries execution plans, firstly in the plan for the original query:
the sort that accounts for 42.5% of the plan cost has sorted 11,870,752 rows with an estimated row size of 89 bytes, totaling 1,008 MB of data, the next sort along accounts for 42.6% of the queries total cost and has a spill warning and has sorted the same number of rows with exact same average row size.
In the plan for the re-written query:
the sort that immediately follows the index seek has sorted 11,870,752 rows, each with an estimated row size of 23 bytes, a total of 260 MB of data in total, the next sort along has sorted the exact same number of rows with the exact same estimated row size, totaling 260 MB in size also:
The CROSS APPLY is a T-SQL language construct of immense power, in this example by forcing the sort to take place on the rows retrieved from bigTransactionHistory before the join to the bigProduct table, two sorts of 1,008 MB have been turned into two sorts of 260 MB and have helped avoid a sort spill in the process. The key to this is the average row size, 23 bytes before the two tables are joined 89 bytes after. It is well documented that pushing filters into an execution plan as deeply as possible is highly desirable, in this example pushing sort iterators down into the plan via the CROSS APPLY has yielded a respectable performance gain at lows degrees of parallelism.
I set out writing this post with the expectation of being able to show the cost of the re-partition streams iterator, when in fact the facts proved the merits of sort iterator push down.
. . . to come in the next blog post, the answer to: