As long as relational databases have been around, query performance has always been one of the top issues that every DBA has to spend some amount of time troubleshooting. As much as we would like to be able to make sure the performance of a query is optimal before it goes into production, there are many factors that come into play that can prevent this from happening. Inadequate test environments, changing data in production (data volume or the data distribution/cardinality), lack of testing and even other software running on the server are just some of the factors that can lead to poor query performance.
No one likes getting that call that the application is down or is very slow responding and they suspect it is the database that is causing the issue. It would be great that if when you do get that call, you could perform a quick check and see if there are any issues on any of your databases that could be causing this poor performance.
Also, while it's never fun dealing with an ongoing performance crisis at least with a real-time performance issue it gives you an easy place to start your troubleshooting. Having someone send an email or catch you in the lunch room and tell you they were running some reports or using an application at some point in the past (days or even weeks back) and the application response time was much slower than normal. Would you be able to look into if it was at all caused by a database system or query issue?
Having a tool that is able to provide a quick diagnosis of these types of issues is something every database professional should have in their toolbox. Could you suggest a solution that could provide insight into the questions posed above?
Having reliable predictable system performance is the goal of every IT professional and this is even truer for a database professional given that the data is the backbone of almost every organization. While you could build your own set of scripts using views, events, auditing, etc. to perform these tasks. Building, and more importantly maintaining, these scripts can require a lot of time, something that most database professionals do not have a lot of to spare. Having a tool that is able to pinpoint any anomalies with respect to performance is something every database professional would welcome. One product that can do this is the SolarWinds Database Performance Analyzer for SQL Server (DPA). This software is capable of monitoring multiple platforms, Oracle, MySQL, DB2, SAP ASE, Cloud and of course SQL Server. In order to demonstrate some of the features on the product let’s walkthrough a few typical issues that pretty much every database professional has seen at one time or another and show how this solution can be used to pinpoint and solve these issues. In each of the following examples we will assume that all we know is someone told us the system was slow at some point in time and they suspect it is/was the database that caused the issue.
- User is currently running a simple report that looks to be hung
- User was running a report yesterday night that normally takes seconds, but was taking minutes
- SAN/server admin says they were seeing IO spikes from the database server that were hitting the maximum number of IOPS that the server can handle
Before we dive into the examples, let me explain at a high level how the software works. Most of the performance data collected by the tool is based on wait statistics. Once per second the tool connects to the database and gathers a small subset of metadata related to each of the active queries. Collecting only a small subset of metadata means there is very little overhead for these once per second calls. At a much lower frequency the detailed wait statistics and other performance data is collected and backfilled to provide a complete picture of what was happening on your database instance. Also important to note is that the tool does all of this without any software having to be installed on your production database server. It instead makes a JDBC connection from a separate server to collect the data and stores the data on a database installed on this server. Performance data is kept for 30 days which gives you the ability to look for any deviations from your baseline performance statistics by comparing what is currently happening in your database with what was going on at any other given time in the last 30 days.
Issue 1 - Blocking Query in SQL Server
A user has informed you that they have a simple report that has been running for over 5 minutes and it usually completes in under 10 seconds. They ask you to take a look and see if you see any issues in the database that could be causing this report to run slow slowly.
Let’s start by opening the web-based UI for the SolarWinds Database Performance Analyzer for SQL Server. Once logged in you are presented with the home page which gives a high level summary of your monitored database instances. In this case we only have one SQL Server instance, but in your case you would start by clicking on the hyperlink (circled in green in the screenshot below) for the instance associated with the application experiencing the issue.
After clicking the hyperlink we need to select the “Current” tab (circled in green) in order to get a look at what is currently happening in our SQL Server instance. This tab gives us an overview of the SQL statements with the highest execution times over the last hour and you can drill down into the details by selecting the time hyperlinks below the graph or the SQL statement id’s to the right of the graph. For this issue though we are more interested in any blocking sessions and as shown below (circled in yellow) we have 2 sessions currently blocked.
Selecting the “Show blocked sessions” hyperlink from the screen above we are presented with a blocking tree shown below.
If we select the “Action” dropdown and then “Recent SQL” a window opens below this tree which shows the query that was called by our report and it’s waiting on spid 55.
If we login to our instance running on the SQL-SERVER server (or use the “Action” -> “Recent SQL”) we would see that session 55 has an open transaction where they have updated the entire Sales.SalesPerson table, but they have not committed their transaction. Here is the last batch executed by SPID 55.
begin transaction; update Sales.SalesPerson set SalesYTD=0;
We can either kill this session from the DPA (by selecting “Action” – “Kill Session”), from within SSMS (using kill 55) or simply contact the user and ask them to commit or rollback their transaction so this report can complete.
Issue 2 – SQL Server under CPU load
A user bumps into you in the morning (July 26th) and mentions quickly in passing that they were working late last night and running some reports that normally complete in a few seconds, but for some reason last night were taking almost 2 minutes. They ask if when you have time later in the day if you could take a look. The only other information they give you is that the reports were run between 3:00 and 3:30am.
This use case is where the SolarWinds Database Performance Analyzer for SQL Server can really give you a lot of historical detail you would not normally be able to query. We start again as we did in the last example by selecting the database instance that the report would have been run against.
This time we will stay on the “Trends” tab (circled in yellow) which gives us a high level summary of the top SQL statements that have been executed over the past 7 days. Since we know this report was run on July 26th we will drill down into this day by using the hyperlink below the graph (circled in green).
Once selected we see a similar chart to the one from the previous screen only now it’s broken down by hour for the day of the 26th. We can already see from the graph that something strange was going on between 2:00am and 4:00am. From here you can drill down even further by selecting one of the hyperlinks below (circled in yellow) or by using the “Time” dropdown (circled in green). Since we know this happened between 3:00am and 3:30am let use the “Time” dropdown and select the “3:00am to 4:00am” time window.
After drilling down even further we can see below that there weren’t many queries run at this time (which you would expect given the time of day) and we can see the high wait times for the report that was trying to be run. But why the high wait times? Let’s take a look at some of the server performance statistics around this time using the “Timeslice” tab (circled in green).
Once selected we are again presented with a summary of the top SQL statements broken down into 10 minute slices over the time frame we selected earlier. Also, below are some charts which plot the status for selected system resources over the same time frame. Here looks to be the root cause of our issue. We can see that although our database instance is not very active at this time the CPU has been pegged at 100% for the entire hour. This report gives you enough information to convince your system administrators that they will need to look into the server to see what was consuming all the CPU resources on the server. In this case it was me running CPUSTRES, oops.
One other thing I’d like to highlight with the “Timeslice” tab is that you can add other resource charts to the UI using the “Add Resource Chart” button (circled in green). Many of these charts can be very helpful in correlating poor query performance to system performance.
Issue 3 – SQL Server performing abnormally large amount of IO
You get an email from your server/SAN administrator and they tell you that one of your database servers has been hitting the maximum IOPS that the server can handle every 30 minutes for the last day or so. He wants you to take a look and see if this is something that can be fixed within the database or if they should consider moving this database instance to more powerful hardware.
As we’ve done for the first couple examples let’s open the web-based UI for the SolarWinds Database Performance Analyzer for SQL Server and drill down into the database instance that is having the IO issues (circled below in green).
Since we know this issue has only been occurring for the last day or so let’s drill down into the current day by selecting July 28th (circled in green).
We know that we are looking for queries that might be IO intensive so let’s select the “Waits” tab (circled in green) and look for any IO related waits (circled in yellow). Based on the bar graph it looks like we might have an issue related to PAGEIOLATCH_* waits.
One thing I would like to highlight here is that you can click on the hyperlink for each wait on the right and it will give you a description of the wait, who typically resolves this type of wait and some possible solutions. You can see in the example below that PAGEIOLATCH waits are used for disk-to-memory transfers, exactly the issue we are trying to resolve.
After drilling down into any of the time hyperlinks from the previous chart and selecting the “SQL” tab on the following window (circled in green) we can easily see which query is experiencing the high amount of PAGEIOLATCH waits. Highlighting the query and seeing the actual SQL being executed it looks like this query is doing a full table scan as there is no WHERE clause in the statement.
Let’s select this statement and get some more details. From the screenshot below we have couple different options that can help us determine if this query is indeed the root cause of the excessive IO. The system/SAN administrator gave us two other pieces of information. The IO spikes have been happening for a day or so and they occur every 30 minutes. The view historical charts (circled in yellow) can give us a good idea of how long this query has been running and the “Timeslice” tab (circled in green) can allow us to correlate the query executions with our system resource usage.
Looking first at the “Timeslice” tab we can see that this query is running every 30 minutes and by including the “Page Reads per Second” resource chart we can see a large spike in this metric when this query I am running.
Switching over to the “View Historical Charts” link we can confirm that this query has only been running for the last day or so as the first execution looks to be on July 27th.
I think it’s safe to assume that this query is our culprit and given that it’s running on a 30 minute interval we should be able track down what is running this fairly easily as it’s definitely something that is scheduled through the SQL Server Agent or our application. Once identified we can have the user who wrote it either turn it off or update the query to include a where clause so we can avoid these large IO spikes on our database instance.
- Learn more about SolarWinds Database Performance Analyzer for SQL Server and get your free evaluation version
- Read other tips on using Database Performance Analyzer
- Read more on the other platforms that the Database Performance Analyzer supports:
- Additional Resources
MSSQLTips.com Product Spotlight sponsored by SolarWinds, makers of Database Performance Analyzer.
Last Update: 2017-09-14
About the author
View all my tips