Perfmon Counters for CPU Usage, Memory, Disk and Network Performance

By: Ben Snaidero

Overview

When it comes to deciding what to monitor in regards to general server health I like to break it down into the following areas: Processor Utilization, Disk Activity, Memory Usage and Network Usage. The following set of counters will give you a good indication of any issues that could be affecting any of these areas.

[Processor] "% Processor time"

This is the percentage of total elapsed time that the processor was busy executing. This counter provides a very general measure of how busy the processor is and if this counter is constantly high, say above 90%, then you'll need to use other counters (described below) in order to further determine the root cause of the CPU pressure.

[Processor] "% Privileged time"

This measures the percentage of elapsed time the processor spent executing in kernel mode. Since this counter takes into account only kernel operations (eg. memory management) a high percentage of privileged time, anything consistently above 25%, can usually point to a driver or hardware issue and should be investigated.

[Processor] "% User time"

The percentage of elapsed time the processor spent executing in user mode. To describe this simply, it's really the amount of time the processor spent executing any user application code. Generally, even on a dedicated SQL Server, I like this number to be consistently below 65% as you want to have some buffer for both the kernel operations mentioned above as well as any other bursts of CPU required by other applications. Anything consistently over 65% should be investigated and this might mean digging deeper into exactly what process is using the CPU (if it's not SQL Server) by using the "[Process] % User time" counter for the individual processes.

[Processor] "Queue Length"

This is the number of threads that are ready to execute but waiting for a core to become available. On single core machines a sustained value greater than 2-3 can mean that you have some CPU pressure. Similarly, for a multicore machine divide the queue length by the number of cores and if that is continuously greater than 2-3 there might be CPU pressure.

[Physical Disk] "Current Disk Queue Length" & "Avg. Disk Queue Length"

The "Current Disk Queue Length" a direct measurement of the disk queue length at the time it is sampled so in most cases it is better to monitor "Avg. Disk Queue Length" as this value is derived using the (Disk Transfers/sec)*(Disk sec/Transfer) counters. Using this calculation gives a much better measure of the disk queue over time, smoothing out any quick spikes. Having any physical disk with an average queue length over 2 for prolonged periods of time can be an indication that your disk is a bottleneck.

[Physical Disk] "% Idle Time"

This is a measure of the percentage of time that the disk was idle. ie. there are no pending disk requests from the operating system waiting to be completed. Ideally you would want this value to be as high as possible but even low values are acceptable assuming the queue length counters mentioned above are low.

[Physical Disk] "Avg. Disk sec/Read" & "Avg. Disk sec/Write"

These both measure the latency of your disks, that is, the average time it takes for a disk transfer to complete. Although this value is displayed in seconds it is actually accurate down to milliseconds. Good and bad values for these counters are really dependent on your hardware. For example, you would expect lower values for an SSD as compared to any spinning disk. For counters like this I find it best to get a baseline after the hardware is installed and use this value going forward to determine if you are experiencing any latency issues related to the hardware.

[Physical Disk] "Disk Reads/sec" & "Disk Writes/sec"

These counters each measure the total number of IO requests completed per second. Similar to the latency counters, good and bad values for these counters depend on your disk hardware but values higher than your initial baseline don't normally point to a hardware issue in this case. Rather the high values usually mean that there are queries, new or old, that are missing indexes and are reading more data than required.

[Memory] "Available MBs"

The total amount of available memory on the system. Usually you would like to keep about 10% free but on systems with a really large amounts of memory (>50GB) there can be less available especially if it is a server that is dedicated to just a single SQL Server instance. If this value is lower than this then it's not necessarily an issue provided the system is not doing a lot of paging. If you are paging then further troubleshooting can be done to see which individual processes are using most of the memory.

[Memory] "Pages/sec"

This is actually the sum of "Pages Input/sec" and "Pages Output/sec" counters which is the rate at which pages are being read and written as a result of pages faults. Small spikes with this value do not mean there is an issue but sustained values of greater than 50 can mean that system memory is a bottleneck.

[Memory] "Paging File(_Total)\% Usage"

The percentage of the system page file that is currently in use. This is not directly related to performance, but you can run into serious application issues if the page file does become completely full and additional memory is still being requested by applications.

[Network interface] "Bytes total/sec"

This counter measures the number of bytes being transferred through your network adaptor. You want to make sure that your network adaptor can handle the amount of traffic flowing in and out of your server and ideally you would not want to see this value go over 75% (ie. 75MB/s for a 100MB/s network adaptor) for prolonged periods of time. If you do see values above this you should consider adding another network adaptor or you can use the "Bytes sent/sec" and "Bytes received/sec" counters to determine if it's incoming or outgoing traffic that is causing the bottleneck.