By: Tim Smith | Updated: 2019-02-08 | Comments | Monitoring
Our support engineers and junior DBAs are currently overwhelmed by a volume of alerts from errors each day. Most of these errors are not something they can take action on immediately and some require further analysis as to whether the error is something we expect, or something that may be a warning sign. How can we organize our alerts to help our support engineers and junior DBAs?
Let's look at a reporting feature over an alerting feature and where we may want to use one over the other in a few key cases:
- Does the event halt customers, clients or colleagues to where nothing can happen on their end? If the answer is yes, we want an alert.
- Does the event need to be solved as soon as possible, as it could become worse (such as database corruption, even if the database can still function)? If the answer is yes and we cannot wait to solve this, we want an alert.
- Will the event happen multiple times a day? We may want an alert if the first item applies, but if this doesn't stop anything for customers, clients or colleagues, we risk interrupting people about an event that may not need immediate attention.
- Related to point three, do we want to know how many times the event happened, such as an automatic application or server restart if CPU or memory crosses a threshold? If so, reporting offers a better option with a daily summary over alerting, as this event may happen multiple times and we want to know those details but without stopping any work.
Creating SQL Server Error Message Table and Loading Data
One design for this where we can consolidate messages from different sources (such as the server error log, custom messages from objects we might be monitoring for changes, custom application logs within our database, etc) is to use a two column table with a message column and a date and time column.
For this tip, we'll be using the below structure for our examples on alerting - the below code also shows insert statements from various sources we'd use (one of which would be a custom application log in SQL Server). How would we know when to add more possible messages to this table? Anything we would want to track and alert over time - such as 30 messages.
CREATE TABLE serverErrorLog ( TextMessage VARCHAR(MAX), LogDate DATETIME ) -- Examples of information we might save ---- Saving details from the server error log: DECLARE @save TABLE(LogDate DATETIME, ProcessInfo VARCHAR(10), [Text] TEXT) INSERT INTO @save EXEC ('EXEC sp_readerrorlog 0,1') INSERT INTO serverErrorLog (TextMessage, LogDate) SELECT [Text], [LogDate] FROM @save -- Note that we may want to use filters for errors we want aggregates on
Here are some other examples of saving custom messages:
---- Saving custom details from object information INSERT INTO serverErrorLog (TextMessage, LogDate) SELECT ('Procedure change: ' + [name]), modify_date FROM sys.procedures WHERE modify_date > DATEADD(MINUTE,-15,GETDATE()) ---- Here is an example of saving details from a custom application log if one exists INSERT INTO serverErrorLog (TextMessage, LogDate) SELECT ErrorMessage, ErrorDate FROM tbETLLoader
Finding SQL Server Errors
We'll start with an example where we want an alert - it involves the first situation from our list - a fatal error in an application that will disrupt clients and colleagues. Just one message involving fatal error counts (as we see a warning telling us there have been 48 fatal errors) would be enough to raise an alert.
DECLARE @count SMALLINT, @message VARCHAR(MAX) SELECT @count = COUNT(TextMessage) FROM serverErrorLog WHERE LogDate > DATEADD(MINUTE,-1,GETDATE()) AND TextMessage LIKE 'Fatal error%' SET @message = 'Fatal errors detected: ' + CAST(@count AS VARCHAR(3)) SELECT @message MessageTest -- use this to send email message EXEC msdb.dbo.sp_send_dbmail @profile_name = 'alt1' , @recipients = '[email protected]' , @subject = @message , @body = @message
Let's evaluate this and why it would qualify as the first item from our list - meaning we'd issue an alert:
- We have a table where we're storing errors we want to use for alerts or reports.
- We've defined a fatal error as an alert (AND TextMessage LIKE 'Fatal error%') and counted how many of these we found. For every alert, we would want a defined rule and in some cases, we may have a 100 rules that could possibly worth raising an alert.
- We're looking every minute for fatal errors; when we raise alerts on issues, we want to find those issues fast. In this case, we see the application found 2 fatal errors in the last minute.
- Our message (and in this case, subject) communicates exactly what we want. If we wanted more detail, such as an alert with the CHECKDB output, we would include that in the body with a subject of Integrity Check Failed or CHECKDB Failed. Remember security here: we only want an alert to communicate what happened in some situations; other situations may allow us more flexibility, but be careful about communicating too much in an alert.
In the below code, query the same table with this query reviewing the error messages and how many times we received the error messages the previous day. We'll notice the fatal errors return as well as other errors which we didn't have alerts because they fit the third and four item from our list. In some cases, we'll want to exclude alerting items on a daily report, but they can add context - for instance, other warnings which we don't alert may be predicting a future error, which we do want an alert. In cases like that, we may want to add an alert if we discover too many warnings.
SELECT TextMessage , COUNT(TextMessage) CountMessages FROM serverErrorLog WHERE LogDate > DATEADD(DD,-1,GETDATE()) GROUP BY TextMessage
One of the messages that we see returned in the above query relates to login attempts. I included this example intentionally because warnings (or errors) like this can be useful to have context. Returning to the example I listed above where someone tells you, "We had 30 SQL Server restarts last week" - how do we know if that's normal or not? We would want to compare it to our historical data. The same is true with events like login failures. We may average one or two login failures daily, but a day where we have 5 or more could be a red flag. Context matters. In these cases, we'll extend our query:
;WITH GetAverage AS( SELECT TextMessage , CAST(LogDate AS DATE) DateMessages , COUNT(TextMessage) CountMessages FROM serverErrorLog WHERE TextMessage LIKE 'Login attempts%' GROUP BY TextMessage, CAST(LogDate AS DATE) ), UseAverage AS( SELECT AVG(CountMessages) AvgMessages , STDEV(CountMessages) StDevMessages FROM GetAverage ) SELECT * , (SELECT AvgMessages FROM UseAverage) AvgMessages , ROUND((SELECT StDevMessages FROM UseAverage),0) StDevMessages FROM GetAverage
In the above query, we group by the date of the login attempts in our first common table expression and get the average and standard deviation in our second common table expression. From here, we can compare the daily logins with the average and deviation. For the sake of this example, we'll look at the query result. We can actually define a rule with a where clause involving our average and deviation, depending on how strict we want our daily logins to conform to what we expect. Would this be appropriate for all messages we may get on a daily or weekly report? No. This type of reporting is appropriate when we want to compare our daily amount with our historic baseline.
Up to this point we've created alerts on errors that are high priority, we've created a daily report of the previous day's errors and warnings, and we've even created a report that allows us to compare our daily result with the history. Since we're using a table structure similar to SQL Server's error log with at least a date and message, we can simplify this by adding a column indicating priority. In the below code, we do this.
ALTER TABLE serverErrorLog ADD MessagePriority TINYINT DEFAULT 0
In this case, we would report our errors with a priority attached, such as 1 indicating the highest priority (alert must be issued) along with other numbers for lower priorities. Using our first query from above this as an example, if we use a priority column, we now filter by this and issue the alert (which means that any message in the last minute that are priority 1 are sent) along with the messages of those priority 1 alerts. As a security reminder, we should be careful about sending the actual alert - so we might want to re-use the subject here to just report the count of priority 1 alerts, if security is a top priority.
DECLARE @count SMALLINT, @subject VARCHAR(1000), @message VARCHAR(MAX) = '<p>Fatal Errors:</p>' SELECT @count = COUNT(TextMessage) FROM serverErrorLog WHERE LogDate > DATEADD(DD,-1,GETDATE()) AND MessagePriority = 1 SELECT @message += '<p>' + TextMessage + '</p>' FROM serverErrorLog WHERE LogDate > DATEADD(DD,-1,GETDATE()) AND MessagePriority = 1 SET @subject = 'Priority 1 alerts in the last minute: ' + CAST(@count AS VARCHAR(3)) SELECT @subject SubjectTest, @message MessageTest -- use this to send email message EXEC msdb.dbo.sp_send_dbmail @profile_name = 'alt1' , @recipients = '[email protected]' , @subject = @subject , @body = @message
This can simplify our querying and organization, but we have to design for it, and we have to demarcate categories for our alerting, warnings and reporting. In smaller environments, this design may be too much whereas this design works well for large environments, as prioritizing which issues should be resolved as soon as possible becomes paramount.
- Any situation that must be solved immediately should be sent as an alert. All other situations that can take time are not alerts and they undermine what an actual alert is by potentially confusing recipients. In some cases, we can use a reporting feature for immediate alerts, if we have an automated process that will attempt to resolve the reported alert and a failure would issue an alert by contact, such as email, call or other format.
- Situations that should be reviewed on a timely basis - but are not situations that must be solved immediately - should be reported in a review format. We can use a review email for the time period, such as the last day, or we can use an SSRS or other reporting format.
- Over time, we can use a priority system for alerts, warnings and other situations. We may redefine rules because we want to avoid warnings or other situations that weren't initially defined as alerts.
Last Updated: 2019-02-08
About the author
View all my tips