Well Known and Not So Well Known SQL Server Tuning “Knobs and switches”

This blog post consolidates most of the performance ‘Tweaks’ I have used in my blog posts to date.

Many of the techniques I have used and blogged about relate to tuning SQL Server at scale, by this I mean testing the database engine to its limits with the hardware  available. For most ‘Shops’ running SQL Server in production, basic good practice needs to be in place first:

  • Keeping transaction log virtual log files down to a ‘Sensible’ number.
  • The correct configuration of tempdb.
  • Choosing an IO sub system that satisfies the performance requirements of your application, low latency for OLTP, high throughput for OLAP.
  • Ensuring a good page life expectancy through a combination of having enough physical memory in your server and the avoidance of rogue execution plans which are going to trash the buffer cache.
  • Ensuring the best possible execution plan “Shape” for applications going into production, e.g. queries with parallel plans, hash joins and bitmap filters for OLAP style queries.
  • Having an optimal indexing (OLTP) or partitioning strategy (OLAP) in place.
  • Leveraging the correct SQL Server feature(s) for the correct application, e.g. column stores and batch mode for data warehouses.
  • Collecting application performance baselines.
  • etc etc etc

Trace flags 1117 and 1118

Trace flag 1117 causes SQL Server to auto grow each data file in the same file group evenly, trace flag 1118 prevents SQL Server from creating mixed allocation extents, this reduces pressure on the shared global allocation map, it is typically used for when SGAM contention takes place on tempdb. My post on concurrent inserts illustrates the difference it makes:

graph

it also make a difference to the parallel insert in SQL Server 2014:

graph

Trace flag 4199

This trace flag enables optimizer hot fixes as outlined in this Microsoft support note. My best advise is to use this from the outset for green field projects, for applications that have been running without this, look for anomalies in execution plans that suggest an optimizer bug might be at large. At one particular client site I worked at, I noticed no join predicate warnings in execution plan for a fact and dimension tables which were quite clearly joined in the source T-SQL. The application in question was a data warehouse running on SQL Server 2008 R2, the queries in question were associated with cube builds, in addition to the strange no join predicate warnings, there was a lack of parallelism and hash joins in the plans. Enabling this trace flag resulted in parallel execution plans, more hash joins and the few outer rows optimization being used and cube build times dropping from around 1 hour ~ 1 hour and 15 minutes to 10 ~ 15 minutes.

The bottom line is that for systems already in production, this trace flag has to be tested extensively before put into use becomes it comes with a huge regression risk, but it can come with huge rewards also.

Trace flag 2301

This trace flag enables enhanced decision support optimizations, further information on this trace flag can be found in this Microsoft technical note. Trace flag 2301 comes at the expense of increasing compilation and recompilation times. The sweet spot for this trace flag, as “Decision support” suggests is OLAP/reporting/MI style query workloads, rather than turn this on globally, my preference would be to test this on a statement by statement basis using QUERYTRACEON.

Trace flag 2330

Statistics collection for the data structures which sys.dm_db_index_usage_stats and sys.dm_db_missing_index_group_stats use is disabled by specifying this trace flag as a start up parameter. Consider using this if there is heavy spin activity on the OPT_IDX_STATS spinlock.

Trace flag 8008

This CSS Engineers blog post contains probably the best information I have come across to date on how SQL OS scheduling works, I will not reproduce the blog post, however the one part of it which is worth quoting verbatim is:

As always,these are not intended for extended use and should only be used under the guidance of Microsoft SQL Server support.  Because these change the task assignment algorithm they can impact the concurrency and performance of your system.

What this trace flag does is to change the scheduling behavior of SQL OS from using the preferred scheduler to the scheduler with the least load:

A new connection is assigned to the scheduler with the smallest load factor within the same NUMA node.   Load factor is loosely equivalent to the number of tasks assigned to the scheduler.  You can view the scheduler information in the DMV (sys.dm_os_schedulers.)  The scheduler choice becomes the preferred (or hint) scheduler for the life of the connection.

When testing concurrent inserts, I have found that the default scheduler activity has caused ‘Hot’ schedulers with the default SQL OS scheduler behavior, look at CPU 15, 16 and 17 in the image taken from windows performance analyzer:

no_tf8008Now look at how much more even the CPU time is across CPUs with trace flag 8008 enabled:

TF8008

Large memory page support

When using enterprise edition, a server with 8GB of ram or more and the Lock Pages in memory privilege granted to the SQL Server service account, the database engine will use large memory pages. This allows The Look-aside Buffer (TLB) in the memory management unit of the servers CPUs to use 2MB memory pages, without large memory pages in use we have:

Default TLB with large memory pages the picture changes to:

TLB large pagesLarge memory pages result in fewer expensive off CPU trips to main memory in order to perform page table look-ups . In a simple test this is what it can for page look up rates:

large page perfTo quote this CSS engineers blog post, this feature is not for everyone. For example, in the event that an instance fails over in a fail-over cluster, fail over times will be increased because 2MB chunks of contiguous memory will have to be found on the node the failed-over instance is being started up on.

Affinitizing The Database Engine Away From The Log Writer

If you are running an OLTP workload at scale with at least 16 physical CPU cores, a significant amount of CPU time can be expended on the LOGCACHE_ACCESS spinlock. Affinitizing the database engine the database engine away from physical CPU core 0 on NUMA node 0 can help reduce spins on this spinlock. The log writer thread is usually (not always) allocated to the first CPU core on NUMA node 0, the problem this top solves is this:

LOGCACHE_ACCESS

To minimize LOGCACHE_ACCESS spinlock spins, the log writer needs to hand off the spinlock cache entry to the CPU core hosting the worker thread that requires it as fast as possible and then receive it back as fast as possible. If the log writer thread is being context switched in and out of core 0 due to other database engine threads running on it, this will increase the amount of time taken to hand off the LOGCACHE_ACCESS spinlock cache entry and receive it back. Removing core 0 from the CPU affinity mask (CPU 0 and 1 if hyper-threading is enabled) helps to reduce context switching on the log writer thread.

In a simple test I performed, this technique helps to reduce spins on the LOGCACHE_ACCESS spinlock by an order of magnitude:

resultsLog Writer Worker Thread CPU Socket Collocation

In addition to reducing log writer context switches, the distance the LOGCACHE_ACCESS spinlock cache entry has to travel to the worker thread wishing to acquire it has an impact on the number of spins it undergoes:

Numa2Numathe CPU cycle overhead of acquiring and releasing the spinlock is much lower when the log writer and worker thread are collocated:

core2coreIn my most recent blog post under a workload consisting of 10 threads concurrently performing inserts, this is the difference in makes to LOGCACHE_ACCESS spinlock spin activity with the conventional SQL Server engine:

test 1 graph

 and this is the difference it makes with a similar test using Hekaton:

test 2 graphDelayed durability

In the blog post on scaling singleton inserts, the use of delayed durability eliminated WRITELOG writes and reduce the elapsed time for the concurrent insert workload by 100%:

delayed durability

Pre-Sorting Heap Data Prior To Column Store Index Creation

In most data warehouses certain columns used to join fact tables to dimension tables will be hotter than other, those used to facilitate joins with date dimension tables spring to mind. If you can establish what columns these are, a conventional clustered index can be created on the heap first, then a column store index can be created on the clustered index if:

  • a) It has the same name as the clustered index
  • b) The WITH DROP EXISTING clause is specified

This turns the hash probes on the hash aggregate or join following the column store scan from random memory accesses to sequential memory accesses with dramatic results:

scaling

Sequential memory access is a pattern which is easy for the CPU memory pre-fetcher to understand and it results in fewer trips off the CPU out to main memory:

LLCMiss2

So far I have tested this using a warm column store object pool, however, I suspect that this can reproduced if the column store is scanned using a high performance NAND flash PCIe card.

5 thoughts on “Well Known and Not So Well Known SQL Server Tuning “Knobs and switches”

  1. Hey Chris, great stuff as always! 🙂

    Some additions as per our conversation on Twitter:
    – Large Pages are not supported for the Columnstore Indexes
    – Prefetch Disabling Trace Flag 652 is not supported for Columnstore Indexes as well

    This is a SQL Server 2014 CU6 Stand

    • I’ve thought about about this and come to the conclusion that large memory pages should not have much relevance to the batch engine. The batch engine is designed to leverage the CPU pre-fetcher such that when it requires a batch it will be located in the CPU cache hierarchy. Large pages work by increasing the amount of memory that can be covered by the “Look-aside buffer” (TLB) in the CPU memory management unit, this minimizes the penalty of the CPU having to go to main memory in order to look up logical to physical memory mapping information from the page lookup tables. The TLB caches some of this information, up to 32Mb worth depending on which generation of the Xeon processor you are talking about. In fact if you were to by pass main memory, which you can do with Sandybridge micro-architecture based Xeons (and onward) you can pump data into the CPUs level 3 cache, with enough PCIe lanes and cards this is faster than using main memory. If you Google “CPU memory wall” you will find lots of good articles on how the speed of main memory has lagged behind the pace of CPU development. There are lots of discussions about Moore’s law running out of steam when the CPU fabrication process hits 10nm, we will probably hit the CPU memory wall before then, DDR4 memory will help, High Bandwidth memory / Hybrid Memory cubes are probably the long term solution to this.

      Regarding trace flag 652 I can see why this is not supported for column stores. My ioDrive 2 duo achieves 3GB/s with concurrent 64K reads is killed when I force the read IO size to 8K.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s