We experience regular slowdowns on our SQL Server databases. After analyzing the memory and CPU usage we would like to continue the root cause investigation by examining I/O bottlenecks. What is your recommendation to recognize I/O related bottlenecks in SQL Server?
The I/O subsystem is a key factor when it comes to SQL Server performance since database pages are constantly moved in and out of the buffer pool. Also the transaction logs and tempDB generate significant I/O traffic. Therefore you have to ensure that your I/O subsystem performs as expected, otherwise you will be a victim of increased response times and frequent time-outs. In this tip I will describe some of the ways to identify I/O related bottlenecks using the built-in tools and provide some disk configuration ideas.
You can use Performance Monitor to check the load on your I/O subsystem. The following performance counters can be setup to check disk performance.
The PhysicalDisk Object: Avg. Disk Queue Length counter shows you the average number of read and write requests that were queued on the selected physical disk. The higher the number the more disk operations are waiting. It requires attention if this value frequently exceeds a value of 2 during peak usage of SQL Server. If you have multiple drives you should take this number and divide by the number of drives in the array to see if the number is above 2. For example, you have 4 drives and a disk queue length of 10, this would be 10/4 = 2.5, so this would be the value you want to use not 10.
Avg. Disk Sec/Read and Avg. Disk Sec/Write shows the average time of a data reads and writes from/to the disk. It is good up to 10 ms, but it is still acceptable if less than 20 ms. Any higher value needs further investigation.
Physical Disk: %Disk Time is the ratio of elapsed time when the disk drive was busy with read or write requests. The rule of thumb for this value is that it should be below 50 percent.
The counters Disk Reads/Sec and Disk Writes/Sec show you the rate of read/write operations on the disk. It should be less than 85 percent of the disk capacity since the disk access time increases exponentially beyond this value.
You can determine the disk capacity by gradually increasing the load on the system. One way to do this is to use SQLIO. You should look for the point where the throughput is constant, but the latency increases.
You can use the counters for RAID configurations with the following calculations: Raid 0: I/O per disk = (reads + writes) / number of disks Raid 1: I/O per disk = [reads + (writes*2)] / 2 Raid 5: I/O per disk = [reads + (writes*4)] / number of disks Raid 10: I/O per disk = [reads + (writes*2)] / number of disks
Here is an example of your I/O per disk for RAID 1, if we get these values from the counters:
Disk Reads/sec = 90 Disk Writes/sec = 75 The formula for I/O on a RAID-1 array is [reads + (writes*2)] / 2 or [90 + (75*2)] / 2 = 120 I/Os per disk
Dynamic Management Views
There are some useful Dynamic Management Views (DMVs) to check I/O bottlenecks.
An I/O latch wait occurs when a page is accessed for reading or writing but the page is not available in the buffer pool. It causes waits on PAGEIOLATCH_EX or PAGEIOLATCH_SH, depending upon the type of request. These wait types can indicate an I/O bottleneck. You can query the sys.dm_os_wait_stats DMV to find latch wait statistics. You can identify I/O problems if you save query outputs of waiting_task_counts and wait_time_ms values from a normal working state of your SQL Server and compare these values when performance is degraded.
where wait_type like 'PAGEIOLATCH%'
order by wait_type asc
Pending I/O requests can be found by querying the following DMVs and can be used to identify which disk is responsible for the bottleneck.
from sys.dm_io_virtual_file_stats(NULL, NULL) iovfs,
sys.dm_io_pending_io_requests as iopior
where iovfs.file_handle = iopior.io_handle
I would recommend that you check the disk fragmentation and the configuration of your disks used by the SQL Server instance.
Fragmentation of files on NTFS can cause significant reductions in performance. Disks should be defragmented regularly and a defragmentation policy and plan should be put in place. Research shows that in some cases a SAN can actually perform worse with defragmentation enabled thus SANs need to be treated on a case-by-case basis.
Fragmentation of indexes can also cause high I/O utilization on NTFS, but this does not have the same affect on SANs which perform better on random I/Os.
Disk Configuration / Best Practices
As a general rule, you should have log files on a physical disk that is separate from the data files for better performance. The I/O profile for a heavily used database data files (including tempDB) are random. The I/O profile for all database log files are sequential except when a transaction needs to be rolled back.
The internal disks should only be used for database log files, because they excel at sequential I/O, but perform poorly on random I/Os.
The database data and log files should each be placed on their own dedicated disk packs. To ensure optimal performance, I recommend that the database log file be placed on two internal disks configured as RAID 1. The database data file should reside on a SAN system which is only accessed by SQL Server and has controlled querying and reporting. Ad hoc access should be disallowed.
Write caching should be enabled where possible and you should make sure the cache is protected from power failures and other possible failures.
To limit possible I/O bottlenecks for your OLTP system you should not mix OLAP and OLTP environments. Also, make sure your code is tuned and create indexes where necessary to avoid unnecessary I/Os.
Collect and compare performance counters
Analyze DMV information
Run SQL Server Profiler to find high Read and Write queries that can be tuned
First of all seems like reading it a bit late :) But still wanted to ask the details of your statement
"Fragmentation of indexes can also cause high I/O utilization on NTFS, but this does not have the same affect on SANs which perform better on random I/Os"
How index fragmentation would not cause big I/Os on SAN while considering there would be much more page scans and use of fragged indexes always yield poor executions? Are you talking about the fragmentation of indexes on multiple DISKs arrays (the storage)? If this is the case then your statement is very confusing and could mislead others. Please make me understand this. Thanks.
Thursday, July 28, 2011 - 11:18:17 AM - Tibor Nagy
Thanks for the response. So based upon on your feedback, can I safely say that one can let the fragged indexes on SAN since it is good at random I/Os for some time (the performance would not detoriate considerably) ? If yes, then when should I start considering index defragmentation on SANs (if should be)? Thanks in advance.