This post will cover how to navigate the call stack in a trace captured using windows performance toolkit (WPT), firstly let me mention something prevalent across just about all event based profiling tools . . .
Sampling
As far as I am aware all low level event based profilers use sampling, the reason being that they would bring the system being profiled to its knees if every single event was captured. The default sample rate for Windows Performance Toolkit is 1Khz, this can be raised upto a maximum of 8Khz using -SetProfInt 1221. The other thing to watch out for are events being dropped because it cannot flush them to disk fast enough, events are captured in buffers, there are 64 by default each of which is 64K in size.The number of buffers can be increased to 1024 and the buffer size increased to 1024.
When using WPT to profile the execution of different workloads for comparison purposes it is important that
a) xperf is started before the workload executes and stopped it when it finishes
c) sampling rate is the same for both profiling sessions and no events are dropped if possible.
If you follow these two simple guidelines, the fact that events ae captured via a process of sampling should not matter.
Comparative Analysis
Windows performance toolkit provides a feature called “Comparative analysis”, this allows you to compare two event tracing for windows (ETW) files, there is one major limitation of this feature. From what I can tell, both the stack traces in terms of the code paths captured need to be identical for this to work, the idea behind using this feaure is to capture the difference in sampled CPU time between the two ETW traces.
Where To Start ?
This all depends on what you are looking for, If you simply want to get some insight into what the database engine is doing you can dive straight into walking the stack. However, if you want to gain insights into where your CPU cycles are going I would recommend you look at the ‘Computation’ data in tabular form, specifically at the modules that have consumed the most CPU time:
It should come as no surprise that most of the CPU time has been expended in sqlmin.dll, this dll encapsulates the storage engine and the database engine run time, lets drill down into this by including functions in the table:
The windows performance analyzer excerpts I’m using relate to a recent blog post in which I compared what the batch engine was doing for the same query using the exact same data with SQL Server 2014 SP1 versus SQL Server 2016 CTP2:
SELECT SUM(BigColumn1) ,SUM(BigColumn2) ,SUM(BigColumn3) FROM [dbo].[BigDataColumnStore] b [dbo].[DimDate] d ON b.OrderDateKey = d.DateKey GROUP BY CalendarQuarter
This is what the most CPU intensive functions look like for SQL Server 2014 SP1:
Let me highlight two areas of interest in the functions listed in descending sampled CPU order from the SQL 2016 CTP 2 ETW trace that do not appear in its 2014 SP1 counter part:
To grasp some insight into what these function do we need to include the stack in the table and search for these functions:
Navigating Your Way Around The Call Stack
99.9% of the readers of my blog will not be familiar with the internals of the database engine at function call level or have access to the development team like a MVP does. However there is a very easy way to navigate the call stack for the execution of a statement, that is to look for iterator entry points, these act almost like landmarks.
Lets take the execution plan for the statement executed with SQL Server 2016 CTP 2 and identify the function calls which represent the iterator entry points in the call stack.
This is the call stack when viewed through windows performance analyzer, I’ve highled in blue the function calls which relate to everything in the plan except the clustered index scan on the dimension table:
Lets put the execution plan and the weights (sampled CPU time) together:
this is the plan for the same query running under SQL Server 2014 SP1 with sampled CPU time added:
The sampled CPU time (weight) is cummulative as we walk up the call stack. In the example immediately above the figure of 102,629.26 for the hash aggregate that follows the column store scan is obtained by subtracting the weight on the line containing sqlmin!CBpQScanColumnStores::BpGetNextBatch from that on the line with the function call sqlmin!CBpQScanHashAggNew::BpGetNextBatch.
Takeaway #1 We can quanitfy the (sampled) CPU time of every interator in an execution plan
What happens to the sampled CPU time when the column store object pool is warm, note that the column store we are scanning is 28GB in size and the server has 32GB of memory:
These weights (sampled CPU times) were taken using SQL Server 2016 CTP 2.
Takeaway #2 With a warm column store object pool the hash aggregate that immediately follows the column store scan is the most CPU intensive iterator in the execution plan. Its my belief that for any vendor producing a column store engine, the hash join mechanism bubbles to the top of the list of performance bottlenecks once scanning a column store as efficiently as possible is achieved.
Takeaway #3 One of the blind spots of the opitimizer has always been the fact that it assumes the buffer cache is cold, this situation does not appear to have changed with the column store object pool.