Super Scaling SQL Server with Virtualization

A question I received following my pre-conference training day at SQL Bits and during my post-conference training day in Poland was how my material relates to SQL Server running in a virtualized environment. As there are figures floating around suggesting that 80%+ of all SQL Server instances in production run on virtual machines, this seems a perfectly reasonable question to ask. I mentioned to one of attendees of my training day in  Poland that the answer is worthy of blog post in its own right, so here it is. The short answer is that the stuff I talk about is still applicable, however there are some key issues around the configuration of virtual machines that need to be addressed:

Virtual CPU To Physical CPU Core Ratios

CPU cache is king !, by this I mean that a huge performance penalty is paid whenever we have to make any trips out to main memory:

CPU Cache Access in CPU Cycles

The extra dimension to this that virtualisation adds is that we may be sharing physical CPU cores with other virtual machines, in light of this, when pushing the database engine hard I would recommend that at the very least one virtual CPU is tied to a hyper-thread and that virtual processors are ‘Affinitized’ to specific CPU cores. One virtual CPU per physical CPU core is better still.

Also consider spinlock pressure, there are some spinlocks we can prevent activity on period and others which we have to resign ourselves to taking a hit on:

spinlocks hit

With heavy and sustained OLTP workloads we will always run into spinlocks, there are “Usual suspect” style spinlocks for OLTP workloads we will encounter with the legacy engine:

old world spins

In the new world of the in-memory OLTP engine these spinlocks are not so prominent:

new world spins

However, it is still possible for serious spinlock pressure to be created in the in-memory OLTP engine. This is the best case scenario for pushing messages into a queue using the LMax disruptor pattern, my current hardware and the in-memory OLTP engine:

lmax in memory

When performance drops off a cliff 41% of the total CPU sampled by windows performance tool-kit is burned  by one spinlock, no, you have not misread this, contrary to what Microsoft’s marketing people are saying, the in-memory OLTP engine is emphatically not spin-lock free:

cmed hash set

Because of this, you may have to rely on the brute force solution of being able to throw as many CPU cycles at this problem as possible, i.e. you need to be able to acquire and release a spinlock as fast as possible without the hyper visor switching out your virtual machines virtual CPUs.

Another Good Reason For vCPU To Physical Core Affinity

Microsoft have conducted OLTP benchmarks using the legacy engine in which 40% of the servers total CPU capacity has been expended by spins on the LOGCACHE_ACCESS spinlock, to recap this is how it fits into the grand scheme of the logging infrastructure:


The first thing you should do is to address the basic things that lead to log generation in the first place as per Paul Randall’s trimming the transaction log fat blog post series, after that the next thing to do is to give the log writer its own CPU core by taking core 0 out of the CPU affinity mask, this is the impact this can have:

aff mask trick

Note that this advice applies to SQL Server 2014, SQL Server 2016 starts up multiple log writers when running on a server with multiple  CPU sockets, plus based on my own testing it can lead to poorer performance than SQL Server 2014 under certain conditions, this is due to some kinks which will hopefully be ironed out in the RTM version of SQL Server 2016.

Respecting NUMA Boundaries

Consider the scalability of the legacy database engine as we cross the NUMA boundary, lets look at a row mode hash join to begin with:

row mode hash join scalability

and now a batch mode hash join:

batch mode hash join scalability

and an OLTP workload using the legacy engine:

otlp scalability

There is one very particular door I’m knocking on with all of this and that is despite the fact that VMWare and Hyper-V are both NUMA aware, I would configure my virtual machines to use one CPU sockets worth of cores at the very most. There are some types of workload which will scale well across two sockets:

IM-otlp scalability

If your virtual machine needs to be configured such that it consumes all the CPU resources of an entire blade or server, I would question what value of using virtualisation in the first place.

Storage Considerations

Virtualization is notorious for creating the “Io blender effect”:

io blender

The hyper-visor multi-plex’s IO such that your SAN can ended being hit with the small IO size random IO that they hate. One particular real world example of this I encountered springs to mind, an IT department had a bunch of data marts running on servers with direct attached storage, they consolidated these into a single data warehouse which ran on a virtual machine that used storage allocated from a SAN, a spinning disk based one at that:

DAS versus SAN

Despite the infrastructure people thinking thinking that adding up all the IOPS the direct storage arrays delivered and thinking that this was what their new SAN need to deliver, the SAN ended up being crippled by the IO blender effect. Note that a good all flash array / SAN should not be susceptible to this problem to the same degree.

The Future ?

Another topic that is easily worth a blog post in its own right is that of Microsoft’s scale out file server and SMB 3.0, this allows a storage infrastructure to be built uses a low latency server memory to server memory (RDMA) fabric, this article provides a good entry point into the subject. The salient point of all of this is the physical transports associated with RDMA provide massive bandwidth and low latency to the degree that the storage bottleneck is often the throughput and latency of the physical media the data resides on. Thanks goes to Allan Hirt (@SQLHA) for sharing this information with me.

Historically virtualization has been about saving money through reducing the space, energy, cooling and administration required by an organisations server estate by shrinking it, SMB 3.0 and scale out file server has the ability to help achieve not only this but very respectable performance at the same time. The one downside scale out file server has at present is that it cannot be used with availability groups due to the way that it presents storage to servers.

Final Words: All Abstractions Are Leaky

The concept of the leaky abstraction is attributed to stack exchange co-founder Joel Spolsky. In essence this means that try as we might to hide the gory implementation details of a piece of software or software component, these implementation still have a tendency to stick their head through the layers of abstraction that they sit under. In the context of virtualization this results in the ironic situation of having to understand the underlying hardware used in the infrastructure to a higher degree than we would were we using straight tin.




Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s