Following some feedback from my last blog post:
. . . what is inhibiting scalability ?
As the degree of parallelism is increased, CPU saturation should be achievable unless some contended resource is being waited on, this appears to be not the case with the test query:
Note: all elapsed times in this blog post are captured from a warm buffer cache with management studio set to discard results from the results grid after execution.
the real question IMHO is why the scale goes flat already at 4 cores
With hyper-threading enabled; “At 4 cores” corresponds to a degree of parallelism (DOP) of 8, assuming two hyper-threads per physical core are used.
This is the wait activity for the query run with a DOP of 8:
Can we get more fine grained wait stats ?
The contents of the resource_description column in sys.dm_os_waiting_tasks captures wait activity at execution plan node level, 59 is hard coded into the query as this is the session the query with the CROSS APPLY is running in:
This reveals that CXPACKET wait activity on node 2 is being encountered:
node two is a distribute streams iterator:
Given that CXPACKET is the dominant wait type, lets look at CX Packet wait time as the degree of parallelism is increased:
As the elapsed time plateaus off, CPU time and CX Packet wait time continues to rise. This pattern is similar to that observed with spin lock spins as the number of threads processing an OLTP workload increases.
What exactly is a CX packet wait ?
There are three parallel exchange iterators available to the optimizer:
- Gather streams
- Re-partition streams
- Distribute streams
this is what the distribute streams iterator looks like conceptually:
The call stack for the side of the iterator into which cx packets are appended looks like this:
and the side that consumes the packets has this call stack:
The memory required by parallel exchange iterators is used for the allocation of the CXPacketList first in first out queue, packets are placed onto this structure and removed from it under the protection of the X_PACKET_LIST spinlock.
To use a term coined by Paul White, the SQL Server execution engine uses a “Demand iterator pattern”, data is pulled through the plan from the root node downwards and control is exerted thought the plan from the root node of the plan down. A CX Packet wait takes place when a thread is waiting for a packet of data, CX Packet waits cannot be avoided for queries with parallel plans, however the issue can be exacerbated by three things:
- Data skew across parallel threads
- Hot schedulers
- A degree of parallelism that is too high
What can windows performance toolkit tell is ?
Lets dig into what this iterator is doing using windows performance toolkit, as we are interested in the context switching between the different threads used to execute the query, this is the xperf command line required for this:
xperf -on PROC_THREAD+LOADER+PROFILE+INTERRUPT+DPC+DISPATCHER+CSWITCH -stackwalk Profile+CSwitch+ReadyThread –maxbuffers 1024
after executing the query with a degree of parallelism of 8, issue xperf -d waitanalysis.etl to create the event trace file. Once the trace file is loaded into windows performance analyzer it is the CPU Usage precise data we are interested in:
Drag this onto an analysis canvas, there are several columns in the table if interest, in particular the NewThreadStack is the stack of a thread waiting to be scheduled and the ReadyThreadStack is the stack of the thread which wakes it up:
The metric wait time (in micro seconds) represents the amount of time a thread has spent waiting to be woken up by an event triggered by another thread. The CPU precise usage information allows the threads to be ordered is descending wait time order. An appreciation for the the way windows schedules threads is required in order to understand two key performance metrics:
- Ready time
- Wait time
this is the windows threading model at an extremely high level, the image below comes from this article:
Elements of this should look very familiar, a thread’s ready state is similar to the state of a task thread when it in the runnable queue waiting to be executed by a scheduler and the thread wait state is similar to the state of a task thread when its on the waiting list. Because we are only hitting a peek 17.5% utilisation suggests that ready time is not a factor in the scalability brick wall we are hitting. The tabular view of the CPU usage precise data when sorted in descending CPU time (ms – sum) order looks like this:
Lets stack walk the top thread, the total CPU usage for SQL Server across all logical processors during the sampling period is 27,741 ms:
Line 22 contains the call sqllang.dll!CTds74::SendRowImpl, I suspect the Tds is short for tabular data stream, implying that this thread is heavily involved in rendering the result set. Note line 36, also highlighted, the thread has burned 782,358 ms of CPU time and the context switch has burned 306,617 ms, 40% of the threads CPU time has been burned up by a context switch.
It is both well known and a well documented fact that context switches are expensive, however, using windows performance toolkit has allowed us to quantify this.
What About The Worker Threads ?
The worker threads appear to be mainly waiting on a parallel exchange iterator to provide packets, evident from the call to sqlmin!CXPipe::Pull in the stack for the waiting thread and the function calls in the stack of the thread that wakes it up:
Information from the CPU usage (precise) data allows this data to be represented in graph form:
No threads relating to the execution of the query with a degree of parallelism of 2 show up, which is why the graph starts at a degree of parallelism of 4. For brevity the graph does not go all the way up to a degree of parallelism of 40. CPU time is the average CPU time per worker thread and wait time is the average wait time per worker thread. As it has already been established that the highest CPU utilisation achieved is 16% (across the whole box) ready time can be disregarded as the cause of the poor scalability being observed.
What Does The Graph Tells Us ?
My hope was as average CPU utilisation per thread decreased, wait time would increase, but the wait time data shows no obvious pattern, the conclusion from this is that the waits are being experienced on something that windows performance toolkit is not capturing. Lets try a different approach . . .
Appending a number after the GO at the end of the statement allows it to be executed in a loop from within management studio. Dump files (mini dumps will suffice) can be created whilst the query is running, this is another way of capturing thread stack traces, Thomas Kejser has produced a good article on the subject which can be found here.
There is one commonly recurring stack that appears across the dump files created, it is this:
windbg presents call stacks in a subtly different way to windows performance analyzer, i.e. bottom up and not top down.
The call stacks caught in both the event trace for windows file (via xperf) and the mini dump both suggest that a call to sqlmin!CXPipe::Pull is waiting for a packet to be made available from a parallel exchange iterator. Also the time attributed to waits experienced by sqlmin!CXPipe::Pull is not being captured in its entirety by windows performance toolkit.
Edit 01/12/2015 – Conclusion
In my haste when writing this blog originally I did not explicitly answer the question posed at the top of the post. The scalability of the query is inhibited by the distribute streams iterator, or more precisely by the fact that there are multiple threads contending on a single thread for packets. Removing the top clause from the query removes the distribute streams iterator from the plan, however scalability still goes flat after a degree of parallelism of 8, my suspicion being that the bottleneck becomes the gather streams iterator, . . . which begs the question posed in final thought #2 below.
#1 Execution plan “Shape”
There is obviously a relationship between execution plan shape and scalability, this post has barely scratched the surface of the subject.
#2 The bottleneck of rendering the result set
Every execution plan irrespective of the number of worker threads it uses has a serial region. The parallel insert feature has existed since SQL Server 2014 and is furnished through the SELECT INTO statement, SQL Server 2016 lifts this restriction and allows INSERT SELECT statements to be performed across more than one worker thread. If the bottleneck in our query is the gather streams iterator, what happens if the result set is being rendered into a partitioned table with each worker thread’s data going into its own partition, does this result in better scalability ?.
#3 Modern CPU architectures and plan iterators
Most people will be aware of Moore’s law:
There is another phenomenon which is not so well documented and publicized and that is the addition of a new level to the CPU cache hierarchy every ten years, this is because trips out to main memory are so expensive. The fact that the CPU cache exists in the first place is only part of the story, the types of data structures used and the way in which they are accessed has big performance and scalability ramifications. Hash joins, any form of sort and index seeks all cause random memory access, this makes it difficult for the pre-fetcher to its job as it cannot preempt what to bring into the CPU cache from main memory next if the memory access pattern is random.