Every so often a post appears on linked-in that serves as an oasis of information in a desert of spam. The article I speak of is “Non-Volatile Storage: Implications of The Data Center’s Shifting Center”. Its a god article for the simple fact that it covers storage class memory both in depth and breadth, not simple from a speed perspective but from a wide variety of angles.
To start with:
This change has profound effects:
1. The age-old assumption that I/O is slow and computation is fast is no longer true: this invalidates decades of design decisions that are deeply embedded in today’s systems.
2. The relative performance of layers in systems has changed by a factor of a thousand times over a very short time: this requires rapid adaptation throughout the systems software stack.
3. Piles of existing enterprise datacenter infrastructure—hardware and software—are about to become useless (or, at least, very inefficient): SCMs require rethinking the compute/storage balance and architecture from the ground up.
Because of this disparity most databases have adopted a variety of technique to hide latency and bridge the gap between CPU and spindle IO performance:
- Read ahead
- Write behind
- Periodic check pointing
The threading model and the way in which most modern operating systems handle asynchronous IO is based on the fact that it takes primary storage longer to service an IO requests than it does to perform a context switch:
From a raw and naive perspective of avoidance of CPU core starvation, high performance flash does solve the IO problem.
. . . But All Is Not Well !
The article continues:
To maximize the value derived from high-cost SCMs, storage systems must consistently be able to saturate these devices. This is far from trivial: for example, moving MySQL from SATA RAID to SSDs improves performance only by a factor of five to seven14—significantly lower than the raw device differential. In a big data context, recent analyses of SSDs by Cloudera were similarly mixed: “we learned that SSDs offer considerable performance benefit for some workloads, and at worst do no harm.”4 Our own experience has been that efforts to saturate PCIe flash devices often require optimizations to existing storage subsystems, and then consume large amounts of CPU cycles. In addition to these cycles, full application stacks spend some (hopefully significant) amount of time actually working with the data that is being read and written. In order to keep expensive SCMs busy, significantly larger numbers of CPUs will therefore frequently be required to generate a sufficient I/O load.
Essentially we are turning the historical problem of systems being IO bound on their head. What the paper does not refer to is the protocol being used to access the PCIe flash, the protocol I am alluding to is Non-Volatile-Memory Express or NVMe, the following illustrates how much more efficient it is than SCSI/SATA in terms of CPU cycle usage:
Hot on the heels of this is NVMe over fabrics (MVMeF), with Remote Direct Memory Access (RDBMS) over infiniband gaining traction in this space. The image below comes from this Mellanox blog article and it goes to show that for writes, NVMeF adds 1.9us of latency and for reads it adds 4.76us of latency:
Why Handling Blocks In The Kernel Is Bad For Performance
This is the excerpt in which the article alludes to this:
Another key technique used by high-performance network stacks to significantly reduce latency is bypassing the kernel and directly manipulating packets within the application.13 Furthermore, they partition the network flows across CPU cores,1,8 allowing the core that owns a flow to perform uncontended, lock-free updates to flow TCP state.
While bypassing the kernel block layer for storage access has similar latency benefits, there is a significant difference between network and storage devices: network flows are largely independent and can be processed in parallel on multiple cores and queues, but storage requests share a common substrate and require a degree of coordination.
The transition from user to kernel mode which is expensive and can cost anywhere upwards of 1500 CPU cycles, this is why non volatile memory express handles blocks in user mode.
The notion of I/O-centric scheduling recognizes that in a storage system, a primary task of the CPU is to drive I/O devices. Scheduling quotas are determined on the basis of IOPS performed, rather than CPU cycles consumed, so typical scheduling methods do not directly apply. For example, a common legacy scheduling policy is to encourage yielding when lightly loaded, in exchange for higher priority when busy and in danger of missing deadlines—a strategy that penalizes device polling threads that are needed to drive the system at capacity. The goal of I/O-centric scheduling must be to prioritize operations that drive device saturation while maintaining fairness and limiting interference across clients.
This basic premise underpins why asynchronous IO exists, this allows the thread that issued an IO request to do something else whilst the IO request is serviced or another thread to be scheduled whilst it waits for the IO request to complete. The whole notion of IO sub system performance ( to date ) has driven certain operating system design features such as IO completion ports in windows. Whilst a context switch can be completed within the time taken for an IO requests to complete, this concept behind thread scheduling and asynchronous IO has stood the test of time . . . so far.
I/O completion ports provide an efficient threading model for processing multiple asynchronous I/O requests on a multiprocessor system. When a process creates an I/O completion port, the system creates an associated queue object for requests whose sole purpose is to service these requests. Processes that handle many concurrent asynchronous I/O requests can do so more quickly and efficiently by using I/O completion ports in conjunction with a pre-allocated thread pool than by creating threads at the time they receive an I/O request.
The 64 million dollar question is this, what happens when one of the bedrock principles underpinning scheduling design and asynchronous IO no longer holds true, i.e. we can guarantee that IO requests will complete within a context switch on a consistent basis ?.
JBODs conveniently abstract storage behind this controller; a client need only send requests to the head, without requiring any knowledge of the internal architecture and placement of data. A single SCM can outperform an entire JBOD, but it provides significantly lower capacity. Could a JBOD of SCMs provide high-speed and high-capacity storage to the rest of the datacenter? How would this affect connectivity, power, and CPU utilization?.
There are two elements to this, the physically fabric used to connect the server to the storage and the actual protocol used over it, could NVMe over infiniband be the answer to this ?. “NVMe over fabrics”, refer to this flash memory summit presentation for further information on this, in a nutshell this is an extension to the Non-volatile memory express protocol which allows servers to access remote flash PCIe storage.
Workload-aware Storage Tiering exploits the locality of accesses in most workloads to balance performance, capacity, and cost requirements. High-speed, low-capacity storage is used to cache hot data from the lower-speed tiers, with the system actively promoting and demoting data as workloads change. Failing to implement workload-aware tiering results in high-value SCM capacity being wasted on cold data.
The value you get from tiering and indeed whether there is any value in using it in the first place period depends on your workload, for SQL Server there can be some simple no brainer quick wins to be had with leverage PCIe flash storage which require no tiering considerations, I’m thinking of transaction logs and tempdb specifically.
What is a ‘Balanced system’, there are two aspects to this, one if having a balanced hardware infrastructure:
The other aspect to this is having software which can handle the fact we are no longer IO bound, in the context of the SQL Server database engine this means having synchronisation mechanisms around the parts of the engine that perform IO that can cope with the fact that PCIe flash storage is an order of magnitude faster than spinning disk based stored.
Can we just drop SCMs into our systems instead of magnetic disks, and declare the case closed? Not really. By replacing slow disks with SCMs, we merely shift the performance bottleneck and uncover resource shortfalls elsewhere—both in hardware and in software.
I’ve already demonstrated in other blog post that its possible to create a test with SQL Serve 2016 in which the raw speed of the flash storage causes the multiple log writer threads to fight each other and cause LOGFLUSHQ spinlock contention which creates back pressure on the log cache.
This Is Great, But What Does It Mean For Me ?
It means that we are going to see a seismic shift in where the performance bottlenecks are, it means that instead of a DBA’s efforts being focused on IO, the focus is going to move from wait time to service time. We are not quite there yet in terms of this being the case for every IT and SQL Server shop, but I’m confident that this is a matter of ‘When’ rather than a matter of ‘If’.