By “Core i Xeon” I am referring to the generation of Xeons ushered in by the “Core i” series micro-architecture, the first of which was Nehalem – Westmere being the same as Nehalem but with a die shrink to 32nm:
Xeon E5 1 st Generation: Nehalem and Westmere
Nehalem provided the framework for the first truely modular Intel processor, prior to this the ‘Penryn’ based CPU’s were scaled out by bolting dual core units together. Also, both the Penryn memory controller and the IO hub resided in the supporting chip set, colloquially known as the “North bridge” and “south bridge” respectively:
Nehalem introduced a truely modular design, a new point to point processor inter-connect in the form of the “Quick path interconnect” and a last level cache shared between all cores and DDR3 memory support:
However, the biggest boon for SQL Server workloads was the re-introduction of hyper-threading, re-architected such that CPU stalls or last level cache misses ( the idle cycles whilst data is retrieved from main memory ) were used to accomodate a second thread per physical core. This is known as “Simmulataneous multi-processing” (SMT) as opposed to Clustered Multi Threading as used by AMD:
The sweet spot for hyper-theading is OLTP workloads, it comes into play with a buffer pool is subject to concurrant and random access.
Takeaway #1 E5 Generation 1 – a massive boost is given to throughput for OLTP style workloads via the re-architected version of hyper-threading.
Xeon E5 2 nd Generation: Sandybridge and Ivybridge
Sandybridge was the ‘Tock’ in the cycle that followed Westmere, highlights of the sandybridge micro-architecture included:
- A bi-directional ring bus to connect the L3 (last level) cache to the CPU cores.
- A quad channel memory controller, the Nehalem and Westmere micro-architectures had tripple channel controllers.
- The introduction of (integer only) SIMD “Advanced vector extensions” – AVX for short
- A vastly improved “Front end” – the part of the processor that does memory prefetching, intruction decoding, branch prediction and scheduling. In the diagram below everything to the left of the ‘Decoder’ and including the decoder is the “Front end”, the rest to the right of it is the “Back end”.
However, to quote the realworld tech site frm this article:
Without a doubt, the biggest change in Sandy Bridge-EP is the 40 lanes of integrated PCI-Express 3.0. In all previous server platforms, the discrete I/O Hub was connected to the processor through QPI, wasting coherency bandwidth. This also meant that many systems did not necessarily scale the I/O performance up or down with the number of processor sockets. In aggregate, Sandy Bridge has 80GB/s for I/O and 80GB/s for inter-processor communication over QPI 1.1; in constrast Westmere-EP shares 64GB/s of QPI bandwidth for both I/O and coherency. The tighter physical integration in Sandy Bridge-EP reduces I/O latency by around 15-30%, compared to the prior generation. The I/O controller has a more advanced APIC and can configure one of the 16 lane PCI-E ports for non-transparent bridging.
Takeaway #2 E5 Generation 2 – IO latency and throughput is improved upon compared to generation 1 by integrating the IO hub onto the processor die.
. . . this is kind of important because low latency IO and high IO throughput tends to be beneficial to most database engines ! 😉
Sandybridge also introduced “Data direct IO”, the ability to consume data directly from the PCIe bus without the data having to go via main memory first:
Data direct IO is used by some host bus adapters, it results in lower power consumption and increased throughput:
As far as I know, the is latest Xeon E5 generation is based on the Haswell micro architecture, the increase in the number of transistors used by a Haswell CPU is substantial compared to the number used by its predecessor, to quote an excerpt from this article on the ExtremeTech web site:
It’s also very important to remember that Haswell is a much more advanced chip than Ivy Bridge. In terms of transistor counts, disregarding any changes to the GPU, a quad-core Haswell CPU has roughly 200 million more transistors than a quad-core Ivy Bridge CPU (1.4 billion vs. 1.2 billion). A lot of these transistors were spent on increasing the chip’s IPC (instructions-per-clock) by adding more execution resources and beefing up out-of-order execution (OoOE) capabilities, but most of them are dedicated to brand new features such as AVX2, FMA3, and TSX. These are very powerful features, but for the most part they only boost performance when software has been specifically written/compiled to take advantage of them. Today, we are merely seeing the ~10% speed-up provided by Haswell’s reworked execution core; tomorrow, when software uses AVX2, FMA3, and TSX, the speed-up could be 25% or more.
Haswell supports DDR4 memory, this in theory allows servers to be built containing 12TB of RAM. Note AVX 2, this is the version in which AVX “Comes of age” with regard to being useful to database engine designers and programmers in that it now supports floating point operations, SQL Server 2016 uses this for batch decompression when performing column store scans. Haswell has the ability to execute upto eight micro-instruction per clock cycle, this is a hundred percent increase on previous generations of the E5 architecture. SAP has managed to take advatage of another new feature to appear in the 3rd generation of the E5, Transactional Synchronization Extensions (TSX):
The new Haswell processor architecture also includes the Intel Transactional Synchronization Extensions (TSX) capability which provides hardware-supported lock elision for improved transactional data processing. This innovative technology boosts the performance of in-memory transactional data processing on systems with high core counts by increasing the scalability of thread synchronization.
On one hand, the increasing number of cores that comes with every new generation of processors enhances the performance of large in-memory database systems, but it also poses greater challenges to programmers who need to develop new ways to improve synchronization performance. Writing sophisticated algorithms to run multiple queries in parallel, while at the same time maintaining a synchronized access to internal data structures is not a trivial effort. The Intel TSX capability alleviates this problem by leveraging hardware to dynamically determine whether threads actually need to synchronize even in the presence of conservatively used synchronization. SAP HANA is leveraging the Intel TSX capability to improve its existing lock-based programming model, resulting in faster system performance and extended scalability.
Takeaway #3 E5 Generation 3 – how the E5 v3 helps SQL Server:
- Servers can now contain upto 12TB of Ram
- Anything that requires high memory bandwidth instantly benefits from DDR4 memory => think tabular model scans.
- Faster column store scans with SQL Server 2016, by the database engine leveraging AVX2 instructions when decompressing batches.
- More raw grunt power at the back end of the CPU by a doubling of the instructions that can be executed per clock cycle.
So Chris When Are You Going To Test All Of These ?
This is difficult for several good reasons, it requires both money and space to accomodate all of this hardware and I get grief from my wife for the contents of my “Man cave”, to quote Helen:
Why can’t you have just one smaller computer like most normal people ?
I do have machines with Gen 1 and 2 E5 Xeons, however they are clocked at different speeds thus making it difficult for me to be able to perform like for like tests. Despite the issues and practicalties of performing such tests, I hope to have a new machine with 2 x 10 core Haswell E5 v3’s up and running in the not too distant future. My hope is that I can observe some pressure on certain spinlocks I have been unable to create to date with my current hardware.