Strategies to Improve SQL Server Storage Performance

By:   |   Updated: 2022-11-29   |   Comments   |   Related: > Hardware


Problem

Database systems always face several obstacles, yet performance is often one of the key metrics to determine if a database application is meeting its goals. There are several things that can be done to improve performance such as tuning code to implementing better physical resources such as memory, computing power, and storage. One of the primary bottlenecks for database systems is the lack of sufficient storage throughput and because database applications rely heavily on storage, a poorly implemented storage strategy can quickly become a problem. In this article, we look at several areas where storage makes a difference for SQL Server databases and a solution to help solve these storage bottlenecks.

Solution

SQL Server and Storage

Relational databases, like SQL Server, are what is known as stateful systems. Unlike front-end systems, like web or application servers, which don't store any data. This means database servers rely heavily on both memory and disk to persist and protect their data. A relational database engine is the heart of most applications, this means storage performance directly affects the performance of your application. For example, a transaction can't be completed until it is persisted to the transaction log on disk, which means the disk is the ultimate limiter to performance of the entire application. While SQL Server always reads and writes data from its buffer pool in memory, the data pages must be read from disk into memory. Depending on the storage system, this could be fast or slow and also have a significant performance impact on the database and application.

Having sufficient memory in your server is also critical; having a large enough buffer pool to store your working set of data reduces the pressure on your underlying storage and having enough memory to avoid expensive spills to disk for sorting and joining operations to avoid overuse of tempdb which also relies on storage. Therefore, for systems that need consistent, low latency, the performance of your storage is critical.

There are a few different options for storage, some systems use direct attached storage (DAS), which can offer extremely low levels of latency, however, are limited in flexibility—you can only have as much storage as the server can hold. DAS also offers limited support for virtualization, such as the lack of hyperconverged infrastructure (HCI) which adds a layer of overhead and cost. The most common infrastructure for databases is a storage area network (SAN), which is typically connected to the server by fiber optic cable, which offers low latency connections, and the flexibility of easily adding storage to the server, and without affecting the availability of the server.

The ultimate solution for SQL Server storage and performance is using a Pure Storage FlashArray. Flash array storage offers added benefits beyond best-in-class performance like data reduction and snapshots, that enhance productivity and provide more value. These arrays use all solid-state storage and offer extremely high levels of both IOPs and throughput to support all manner of workloads while offering a simple management interface that does not require elaborate configuration.

Managing Very Large Databases

The trend of big data is not limited to open-source systems like Hadoop, SQL Server databases have been increasingly growing in size and the need to store more and more data continues for all types of organizations. It is now very common to encounter SQL Server databases over 10 TB in size, and even those as large as 50 TB. To meet the IO requirements of these very large databases (VLDB) on legacy hard disk (HDD) based storage systems, storage administrators and DBAs had to come up with creative approaches to storage design, like putting specific datafiles on specific LUNs and allocating more disk spindles than needed for the size of data to meet the performance needs of the database. DBAs in some cases create separate file groups for indexes or specific large tables. Another popular approach is using a local solid-state drive (SSD) for busy tempdb workloads. In many cases, storage arrays run out of IOPs before running out of storage capacity.

Modern storage systems, like a Pure Storage FlashArray, offer the needed storage performance without requiring elaborate file layouts and workarounds to get the maximum storage performance. When using flash array storage, there are still things you need to do to get the best performance like not putting all of your files onto a single volume, especially in a virtual server. The most common approach is to split data files, log files, and tempdb onto their own volumes. This prevents queuing at the OS level, where Windows will have a queue for each volume (commonly a drive letter). Beyond the operating system, if you are using a VMware virtual machine to host your SQL Server, you should create a single virtual controller for each of those storage volumes.

Managing and Supporting Lower Application Environments

While your storage array has the important task of storing and maintaining your data, modern storage arrays should offer services well beyond just data storage. One major time-consuming task that DBAs face is moving data to lower application environments like development, testing, and QA. While these processes are frequently automated, the longest part is the backup and restore process which can often take hours.

Pure Storage FlashArray offers snapshots that allow you to instantly clone databases for lower environments. This process also allows developers to see the most up-to-date data. Because the array can do backups based on snapshots, you can offload CHECKDB activity onto a snapshot of your production database, offloading that workload. You can also take advantage of snapshots to seed availability groups (starting with SQL Server 2022), avoiding the very manual process of creating an availability group. While you can use the snapshots as the core of your backup and restore strategy, you can choose to include traditional backups to disk as part of a mixed backup strategy.

Measuring IO in SQL Server

A common scenario is for a database administrator to run to the storage team after observing high latency in a SQL Server, complaining that the "SAN is slow". When you are presented with such a scenario you need to step back and examine the situation holistically. SQL Server presents storage performance information in a couple of ways—in the worst case you may see the dreaded "SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file" message in your error log. Alternatively, you can query the dynamic management function sys.dm_io_virtual_file_stats which will measure latency for each data and log file in your server. There are a couple of caveats to that DMF—the data is from the time the server started (so if your server recently started experiencing a performance issue it may not be shown), and typically the queries against the DMF look for an average latency (like the query shown below from Glenn Berry):

SELECT 
   DB_NAME(fs.database_id) AS [Database Name], 
   CAST(fs.io_stall_read_ms/(1.0 + fs.num_of_reads) AS NUMERIC(10,1)) AS [avg_read_latency_ms],
   CAST(fs.io_stall_write_ms/(1.0 + fs.num_of_writes) AS NUMERIC(10,1)) AS [avg_write_latency_ms],
   CAST((fs.io_stall_read_ms + fs.io_stall_write_ms)/(1.0 + fs.num_of_reads + fs.num_of_writes) AS NUMERIC(10,1)) AS [avg_io_latency_ms],
   CONVERT(DECIMAL(18,2), mf.size/128.0) AS [File Size (MB)], 
   mf.physical_name, 
   mf.type_desc, 
   fs.io_stall_read_ms, 
   fs.num_of_reads, 
   fs.io_stall_write_ms, 
   fs.num_of_writes, 
   fs.io_stall_read_ms + fs.io_stall_write_ms AS [io_stalls], 
   fs.num_of_reads + fs.num_of_writes AS [total_io],
   io_stall_queued_read_ms AS [Resource Governor Total Read IO Latency (ms)], 
   io_stall_queued_write_ms AS [Resource Governor Total Write IO Latency (ms)] 
FROM sys.dm_io_virtual_file_stats(null,null) AS fs
INNER JOIN sys.master_files AS mf WITH (NOLOCK) 
   ON fs.database_id = mf.database_id AND fs.[file_id] = mf.[file_id]
ORDER BY avg_io_latency_ms DESC OPTION (RECOMPILE);

While the average is useful—it can be skewed. For example, if you have an index rebuild process that runs overnight and saturates your storage array, but is non-impactful to user workloads, that slowdown would still be reflected in the average. This query is a good first start in examining storage performance—if your numbers are low (really good here is < 10 ms, great would be < 3), your storage is performing adequately. If your numbers are bad, you want to do take a deeper look at your storage.

Typically, the next troubleshooting step a DBA should take, if you aren't already doing it, is to monitor storage performance at the operating system layer. In Windows, this is done using performance monitor, or as it is more commonly known (by its executable name) perfmon. There are several storage related counters, but the important ones to monitor are:

  • Disk Seconds/Read
  • Disk Seconds/Write
  • Disk Reads/second
  • Disk Writes/second

These counters can be configured for a total of all volumes, or on each volume individually. Typically, you would want to configure them on each volume so you can isolate any volume specific performance issues. Capturing this perfmon data over time will let you have a better understanding of the IO profile of your SQL Server. Also, you want to correlate this data to your SQL Server workloads—a poor query that is missing indexes can make a small storage problem much worse. If after those examinations, you are still seeing consistent slow storage performance, at this point you can work with your storage administrator to understand what bottleneck you are hitting.

For a better storage management experience, Pure Storage FlashArray includes the Pure1 platform that allows your storage administrators to quickly understand and analyze the performance of the storage array. Pure1 runs as a cloud service which lets you monitor from anywhere. In addition to performance related data, it provides insights to help you better manage your storage assets to save money and to ensure the right resources are deployed.

Backups, Snapshots and Consistency Checks

There are a couple of other concerns around managing very large databases—namely backups and consistency checks. Backups are a critical part of maintaining databases—whether it be to protect against disasters like a data center emergency or accidental deletion of data. There are a handful of tools for making large database backups go more quickly—the first is to use compression as part of the backup command. While there is a small amount of CPU overhead associated with compression, the size of the backups is reduced by around 30% on average. SQL Server 2022 is introducing support for Intel QAT compression drivers, which will allow for even more efficient compression for backups.

Another common best practice for backup performance is to stripe the backup across multiple files. You can stripe to as many files as you wish, but a good baseline is half the number of CPU cores in your server or VM. Striping backups will help both backup and restore performance, as multiple threads are used to perform the action. There are some other configuration options like adjusting memory buffers and the maximum transfer size, however, the largest gains are typically with striping and compression. Both backup and restore performance are limited by the available network bandwidth and storage IOPs available for the operation. Backup writes are one-megabyte (MB) pages, which can quickly saturate a network connection with limited throughput. Likewise, the storage array needs to be able to keep up with sequential writes during both backup and restore processes—it's not simply a matter of network—it's network and stage.

Beyond backups, the other maintenance task that DBAs need to perform to ensure the availability of their data is database consistency checks. These checks, often referred to by the name of the command used to call them, DBCC CHECKDB, confirms that all the pages contain the data that the engine expects them to have. Corruption can happen when storage fails, memory gets corrupted, or in some cases, bugs within SQL Server. Corruption can be hidden—for example, the backup process does not examine the contents of each page for corruption, and will happily backup corrupt data. The CHECKDB process is very resource intensive, as each page is read—this requires both heavy IO and CPU consumption. You should be running CHECKDB at least as frequently as you retain your backups. In the event that your system cannot tolerate the load imposed by the CHECKDB process, you can restore the database and run CHECKDB against the restored copy. While this is not a complete replacement, in most cases it is good enough, as you know your backup contains consistent data.

Pure Storage FlashArray offers a number of benefits for databases—the first being data reduction on the storage array. While techniques like columnstore or page compression can be used to compress data within the database, storage level compression offers high levels of data reduction on the array, which reduces the amount of IO needed to perform your queries. Additionally, having the option of using snapshots for backups or clones of databases can greatly reduce the time associated with backup operations. These CHECKDB operations also benefit from the performance of the array, and can also take advantage of snapshots to perform a CHECKDB for the database.

Summary

There is no question that relational databases like SQL Server are very IO intensive applications. If the storage subsystem cannot keep up with the requirements of the databases, there will be bottlenecks and unhappy users. The most important aspect of storage is delivering that performance, but beyond just performance, features like snapshots and data reduction deliver additional value and productivity benefits.

The simplicity of configuration of Pure Storage FlashArray avoids the complexity in database deployments that was needed in the past while offering the easy-to-use Pure1 monitoring platform. Not all problems can be fixed by changing code and therefore implementing a modern fast storage solution like Pure Storage FlashArray may be the answer you have been searching for.

Next Steps
  • Test drive modern data storage, explore features that optimize database operations, simplify data management and protect your data.
  • Learn more about snapshots and how they can complement your backups in this Snapshots for Dummies Guide.

MSSQLTips.com Product Spotlight sponsored by Pure Storage.

About the author
MSSQLTips author Joseph D'Antoni Joseph D'Antoni is a Senior Architect and SQL Server MVP with over a decade of experience working in both Fortune 500 and smaller firms. He is currently Principal Consultant at Denny Cherry and Associates.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips


Article Last Updated: 2022-11-29

Comments For This Article