A lot of the work I have done over the last year has involved placing stress on the database engine via singleton inserts using a stored procedure that inserts rows in a loop under the context of explicit transaction. I use ostress to fire off multiple executions of the procedure and the table inserted into either has a clustered index key populated by NEWID() or a key crafted from @@SPID + offset in order to avoid the “Last page problem” whereby performance is throttled by the PAGELATCH_EX waits created from inserts into the . . . right most leaf node page in the index b-tree:
The default scheduler behaviour in SQL Server 2014 leads to some logical processors which are completely idle (highlighted in red) and others that are 100% utilised:
The solution to this problem is to use trace flag T8008, resulting in nice and even CPU utilisation across all logical processors:
Scheduling At Extreme Scale
My good friend Thomas Kejser (as you might expect) has well and truly put the SQL Server scheduler under immense pressure, this was during an exercise he conducted with his then Microsoft colleague Henk Van Der Valk. The objective of the exercise was to bulk load a Terabyte of data in 10 minutes, Henk has a blog post on the subject which can be found here. At the time of writing I this post I only have access to a dual CPU socket machine with ten cores per socket, the machine Henk and Thomas used for a 96 core behemoth. One of the problems that Henk and Thomas ran into was that of hot SQLOS schedulers, to get around this they elected to man-handle scheduling:
A soft NUMA node was created for each logical process and script was created to take BCP jobs from a queue connect to a soft NUMA nodes via its connection port and schedule the job. I asked Thomas if he had considered the use of trace flag TF8008, the answer that came back was that in extreme cases trace flag 8008 cannot get around what he described as a race condition using a nice analogy:
The Airport Security Check-In Queue Analogy
Consider security checkpoints that most airports have these days, you will often find a member of airport security directing people to a specific queue. So, you walk up to this person, and he or she will tell you which queue to join, however, that person has incomplete information due to the queue behind them moving while he/she is directing people. The person controlling the ‘Traffic’ into each queue cannot simultaneously look at the queue behind him/her and direct people at the same time, this problem has an inherent race condition. This problem is exacerbated by the fact that even if there is the same amount of people in every queue, some of the queues will drain slower than others. To extend the analogy further, imagine someone who is at the head of the queue who walks through the metal detector and sets it off (this happened to me on the way to speak at Join conference in Poland!). This is akin to a task being held up because it is spinning on a spin lock or waiting on a lock or latch. Because the security person directing the traffic has no foresight of some of the queues moving faster than others they cannot pre-empt this situation. The bottom line is that this creates an inherent imbalance in the scheduler.
The more queues there are, the worse this situation becomes, hence why trace flag TF8008 works for me on my dual socket 20 core server, but not for Henk and Thomas on the beast of a machine they used.
This brings us on to something new, the new default scheduling algorithm in SQL Server 2016 as discussed in this CSS engineers blog post, lets see how this flies:
It appears that there is still value in using trace flag 8008 with SQL Server 2016. I should make the point clear that the new scheduling algorithm (according to the blog post) is intended to cater for mixed workloads instances, i.e. those that process both batch jobs and short transactions not the specific workload that I have. This begs the question . . .
When Should I Worry About Default Scheduling Behaviour ?
Before anyone runs away with the idea that they have the exact same problem to solve as I have had, the singleton insert tests I have used are designed to stress the database engine, in fact I have gone out of my way to do this. If your workload is largely composed of sessions with short transactions, the issue of encountering hot schedulers is probably not going to amount to a big deal, as things will even themselves out over the course of time.
What The Scheduler Should Be Aware Of
To the best of my knowledge the SQL OS scheduler is oblivious to hyper-threading, meaning it may see that a scheduler is idle but be oblivious to the fact that another scheduler may be using the same the same CPU core and maxing it out. Taking my old machine which had two CPU sockets and six cores per socket as an example, this is the relationship between logical processors (hyper-threads) and physical cores:
the scheduler should therefore in my humble opinion do two things:
- When looking to schedule a new task always pick a scheduler on an idle core first.
- Have some concept of the workload taking place at CPU core level before scheduling a task to run on logical processor on a core that is already busy.
And Another Thing . . . The Four ms Quantum
I think that 4ms is ridiculously low for certain types of application, I would like to have the option to increase this up to 16 ms if not higher, the main use case I see for this is instances used for OLAP style workloads and long running batch style processes.