The SQL Server 2016 Scheduler and Trace Flag 8008

A lot of the work I have done over the last year has involved placing stress on the database engine via singleton inserts using a stored procedure that inserts rows in a loop under the context of explicit transaction. I use ostress to fire off multiple executions of the procedure and the table inserted into either has a clustered index key populated by NEWID() or a key crafted from @@SPID + offset in order to avoid the “Last page problem” whereby performance is throttled by the PAGELATCH_EX waits created from inserts into the . . . right most leaf node page in the index b-tree:

last page

The default scheduler behaviour in SQL Server 2014 leads to some logical processors which are completely idle (highlighted in red) and others that are 100% utilised:default scheduling

The solution to this problem is to use trace flag T8008, resulting in nice and even CPU utilisation across all logical processors:TF8008

Scheduling At Extreme Scale

My good friend Thomas Kejser (as you might expect) has well and truly put the SQL Server scheduler under immense pressure, this was during an exercise he conducted with his then Microsoft colleague Henk Van Der Valk. The objective of the exercise was to bulk load a Terabyte of data in 10 minutes, Henk has a blog post on the subject which can be found here. At the time of writing I this post I only have access to a dual CPU socket machine with ten cores per socket, the machine Henk and Thomas used for a 96 core behemoth. One of the problems that Henk and Thomas ran into was that of hot SQLOS schedulers, to get around this they elected to man-handle scheduling:

man handling scheduling

A soft NUMA node was created for each logical process and script was created to take BCP jobs from a queue connect to a soft NUMA nodes via its connection port and schedule the job. I asked Thomas if he had considered the use of trace flag TF8008, the answer that came back was that in extreme cases trace flag 8008 cannot get around what he described as a race condition using a nice analogy:

The Airport Security Check-In Queue Analogy

Consider security checkpoints that most airports have these days, you will often find a member of airport security directing people to a specific queue. So, you walk up to this person, and he or she will tell you which queue to join, however, that person has incomplete information due to the queue behind them moving while he/she is directing people. The person controlling the ‘Traffic’ into each queue cannot simultaneously look at the queue behind him/her and direct people at the same time, this problem has an inherent race condition. This problem is exacerbated by the fact that even if there is the same amount of people in every queue, some of the queues will drain slower than others. To extend the analogy further, imagine someone who is at the head of the queue who walks through the metal detector and sets it off (this happened to me on the way to speak at Join conference in Poland!). This is akin to a task being held up because it is spinning on a spin lock or waiting on a lock or latch. Because the security person directing the traffic has no foresight of some of the queues moving faster than others they cannot pre-empt this situation. The bottom line is that this creates an inherent imbalance in the scheduler.

The more queues there are, the worse this situation becomes, hence why trace flag TF8008 works for me on my dual socket 20 core server, but not for Henk and Thomas on the beast of a machine they used.

This brings us on to something new, the new default scheduling algorithm in SQL Server 2016 as discussed in this CSS engineers blog post, lets see how this flies:

SQL2016 scheduling

It appears that there is still value in using trace flag 8008 with SQL Server 2016. I should make the point clear that the new scheduling algorithm (according to the blog post) is intended to cater for mixed workloads instances, i.e. those that process both batch jobs and short transactions not the specific workload that I have. This begs the question . . .

When Should I Worry About Default Scheduling Behaviour ?

Before anyone runs away with the idea that they have the exact same problem to solve as I have had, the singleton insert tests I have used are designed to stress the database engine, in fact I have gone out of my way to do this. If your workload is largely composed of sessions with short transactions, the issue of encountering hot schedulers is probably not going to amount to a big deal, as things will even themselves out over the course of time.

What The Scheduler Should Be Aware Of

To the best of my knowledge the SQL OS scheduler is oblivious to hyper-threading, meaning it may see that a scheduler is idle but be oblivious to the fact that another scheduler may be using the same the same CPU core and maxing it out. Taking my old machine which had two CPU sockets and six cores per socket as an example, this is the relationship between logical processors (hyper-threads) and physical cores:
better context

the scheduler should therefore in my humble opinion do two things:

  1. When looking to schedule a new task always pick a scheduler on an idle core first.
  2. Have some concept of the workload taking place at CPU core level before scheduling a task to run on logical processor on a core that is already busy.

And Another Thing . . . The Four ms Quantum

I think that 4ms is ridiculously low for certain types of application, I would like to have the option to increase this up to 16 ms if not higher, the main use case I see for this is instances used for OLAP style workloads and long running batch style processes.








5 thoughts on “The SQL Server 2016 Scheduler and Trace Flag 8008

  1. I’ve been working almost exclusively in SQL Server VMware vms for a while (even when, for some work, it would be much easier and cleaner on a physical server).

    The workload I work with has many large scale queries, but some small scale queries as well. Still, even in SQL Server 2016 I saw some hot schedulers/vcpus with others remaining idle. I applied trace flag 8008, expecting to see a gain. My testing results were inconclusive.

    I was really puzzled by this until a little while ago. A big part of the reason I was getting inconclusive results when testing this in VMware: an even larger scheduling kerfuffle was swallowing the difference I was expecting to see 🙂

    “A Decade of Wasted Cores”

    There’s another aspect of virtualization that sometimes swallows performance behavior I expect to observe – I hope to blog about it soon…

  2. Why is SQL Server using its own scheduling system at all? It seems to cause so much trouble. I see nothing in it that the OS can’t handle.

    • The windows server operating system is a “Jack of all trades and master of none”. Dr Michael Stonebraker released a paper on non-pre-emptive scheduling which Bob Ward referenced on his Pass session on SQL OS a few years back when talking about why SQL Server performs non-pre-emptive scheduling. The short answer to why SQL Server performs its own scheduling is because SQL Server uses resources in the context of a database engine such as latches and spin-locks that the operating system has non concept of. Consider a couple of scenarios, take a parallel query for example, without SQL OS all the operating system would see is a bunch of unrelated threads, because SQL OS does have some concept of the nuances of a database engine workload it will attempt to schedule these threads on the same NUMA node / CPU socket (even though this can be more nuanced with SQL Server 2016). Another example, spin locking, the operating system has concept of the resource that a task’s thread has been spinning on how long it has been spinning and when back off should take place.

    • If I refer you to slide 8 of this deck I produced for pure insert throughput NEWSEQUENTIALID() is not that much better than using IDENTITY(). I don’t know if you are an avid reader of my blog or this is the first post of mine you have seen, however the general format of my posts is to state:

      1. What I’m setting out to achieve
      2. What my test environment is
      3. How I create my test data
      4. How I run my tests
      5.How my tests behave when I change different things

      The link to the stackoverflow article you have provided goes into a lot of pontificating around the subject without actually presenting any hard evidence as to the fact that for sheer concurrent multi-threaded insert throughput with low latency storage and a clustered index, NEWID() is far and away superior to NEWSEQUENTIALID().

      I would invite you to perform some **** actual testing **** based on this specific test case and get back to me . . .

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s