This blog post consolidates most of the performance ‘Tweaks’ I have used in my blog posts to date.
Many of the techniques I have used and blogged about relate to tuning SQL Server at scale, by this I mean testing the database engine to its limits with the hardware available. For most ‘Shops’ running SQL Server in production, basic good practice needs to be in place first:
- Keeping transaction log virtual log files down to a ‘Sensible’ number.
- The correct configuration of tempdb.
- Choosing an IO sub system that satisfies the performance requirements of your application, low latency for OLTP, high throughput for OLAP.
- Ensuring a good page life expectancy through a combination of having enough physical memory in your server and the avoidance of rogue execution plans which are going to trash the buffer cache.
- Ensuring the best possible execution plan “Shape” for applications going into production, e.g. queries with parallel plans, hash joins and bitmap filters for OLAP style queries.
- Having an optimal indexing (OLTP) or partitioning strategy (OLAP) in place.
- Leveraging the correct SQL Server feature(s) for the correct application, e.g. column stores and batch mode for data warehouses.
- Collecting application performance baselines.
- etc etc etc
Trace flags 1117 and 1118
Trace flag 1117 causes SQL Server to auto grow each data file in the same file group evenly, trace flag 1118 prevents SQL Server from creating mixed allocation extents, this reduces pressure on the shared global allocation map, it is typically used for when SGAM contention takes place on tempdb. My post on concurrent inserts illustrates the difference it makes:
it also make a difference to the parallel insert in SQL Server 2014:
Trace flag 4199
This trace flag enables optimizer hot fixes as outlined in this Microsoft support note. My best advise is to use this from the outset for green field projects, for applications that have been running without this, look for anomalies in execution plans that suggest an optimizer bug might be at large. At one particular client site I worked at, I noticed no join predicate warnings in execution plan for a fact and dimension tables which were quite clearly joined in the source T-SQL. The application in question was a data warehouse running on SQL Server 2008 R2, the queries in question were associated with cube builds, in addition to the strange no join predicate warnings, there was a lack of parallelism and hash joins in the plans. Enabling this trace flag resulted in parallel execution plans, more hash joins and the few outer rows optimization being used and cube build times dropping from around 1 hour ~ 1 hour and 15 minutes to 10 ~ 15 minutes.
The bottom line is that for systems already in production, this trace flag has to be tested extensively before put into use becomes it comes with a huge regression risk, but it can come with huge rewards also.
Trace flag 2301
This trace flag enables enhanced decision support optimizations, further information on this trace flag can be found in this Microsoft technical note. Trace flag 2301 comes at the expense of increasing compilation and recompilation times. The sweet spot for this trace flag, as “Decision support” suggests is OLAP/reporting/MI style query workloads, rather than turn this on globally, my preference would be to test this on a statement by statement basis using QUERYTRACEON.
Trace flag 2330
Statistics collection for the data structures which sys.dm_db_index_usage_stats and sys.dm_db_missing_index_group_stats use is disabled by specifying this trace flag as a start up parameter. Consider using this if there is heavy spin activity on the OPT_IDX_STATS spinlock.
Trace flag 8008
This CSS Engineers blog post contains probably the best information I have come across to date on how SQL OS scheduling works, I will not reproduce the blog post, however the one part of it which is worth quoting verbatim is:
As always,these are not intended for extended use and should only be used under the guidance of Microsoft SQL Server support. Because these change the task assignment algorithm they can impact the concurrency and performance of your system.
What this trace flag does is to change the scheduling behavior of SQL OS from using the preferred scheduler to the scheduler with the least load:
A new connection is assigned to the scheduler with the smallest load factor within the same NUMA node. Load factor is loosely equivalent to the number of tasks assigned to the scheduler. You can view the scheduler information in the DMV (sys.dm_os_schedulers.) The scheduler choice becomes the preferred (or hint) scheduler for the life of the connection.
When testing concurrent inserts, I have found that the default scheduler activity has caused ‘Hot’ schedulers with the default SQL OS scheduler behavior, look at CPU 15, 16 and 17 in the image taken from windows performance analyzer:
Large memory page support
When using enterprise edition, a server with 8GB of ram or more and the Lock Pages in memory privilege granted to the SQL Server service account, the database engine will use large memory pages. This allows The Look-aside Buffer (TLB) in the memory management unit of the servers CPUs to use 2MB memory pages, without large memory pages in use we have:
To quote this CSS engineers blog post, this feature is not for everyone. For example, in the event that an instance fails over in a fail-over cluster, fail over times will be increased because 2MB chunks of contiguous memory will have to be found on the node the failed-over instance is being started up on.
Affinitizing The Database Engine Away From The Log Writer
If you are running an OLTP workload at scale with at least 16 physical CPU cores, a significant amount of CPU time can be expended on the LOGCACHE_ACCESS spinlock. Affinitizing the database engine the database engine away from physical CPU core 0 on NUMA node 0 can help reduce spins on this spinlock. The log writer thread is usually (not always) allocated to the first CPU core on NUMA node 0, the problem this top solves is this:
To minimize LOGCACHE_ACCESS spinlock spins, the log writer needs to hand off the spinlock cache entry to the CPU core hosting the worker thread that requires it as fast as possible and then receive it back as fast as possible. If the log writer thread is being context switched in and out of core 0 due to other database engine threads running on it, this will increase the amount of time taken to hand off the LOGCACHE_ACCESS spinlock cache entry and receive it back. Removing core 0 from the CPU affinity mask (CPU 0 and 1 if hyper-threading is enabled) helps to reduce context switching on the log writer thread.
In a simple test I performed, this technique helps to reduce spins on the LOGCACHE_ACCESS spinlock by an order of magnitude:
In addition to reducing log writer context switches, the distance the LOGCACHE_ACCESS spinlock cache entry has to travel to the worker thread wishing to acquire it has an impact on the number of spins it undergoes:
In my most recent blog post under a workload consisting of 10 threads concurrently performing inserts, this is the difference in makes to LOGCACHE_ACCESS spinlock spin activity with the conventional SQL Server engine:
and this is the difference it makes with a similar test using Hekaton:
In the blog post on scaling singleton inserts, the use of delayed durability eliminated WRITELOG writes and reduce the elapsed time for the concurrent insert workload by 100%:
Pre-Sorting Heap Data Prior To Column Store Index Creation
In most data warehouses certain columns used to join fact tables to dimension tables will be hotter than other, those used to facilitate joins with date dimension tables spring to mind. If you can establish what columns these are, a conventional clustered index can be created on the heap first, then a column store index can be created on the clustered index if:
- a) It has the same name as the clustered index
- b) The WITH DROP EXISTING clause is specified
This turns the hash probes on the hash aggregate or join following the column store scan from random memory accesses to sequential memory accesses with dramatic results:
Sequential memory access is a pattern which is easy for the CPU memory pre-fetcher to understand and it results in fewer trips off the CPU out to main memory:
So far I have tested this using a warm column store object pool, however, I suspect that this can reproduced if the column store is scanned using a high performance NAND flash PCIe card.