Learn more about SQL Server tools

mssqltips logo
 

Tutorials          DBA          Dev          BI          Career          Categories          Webcasts          Whitepapers          Today's Tip          Join

Tutorials      DBA      Dev      BI      Categories      Webcasts

DBA    Dev    BI    Categories

 

SQL Server T-SQL Alerts for Standard and Custom Errors


By:   |   Last Updated: 2019-02-08   |   Comments   |   Related Tips: More > Monitoring

Problem

Our support engineers and junior DBAs are currently overwhelmed by a volume of alerts from errors each day. Most of these errors are not something they can take action on immediately and some require further analysis as to whether the error is something we expect, or something that may be a warning sign. How can we organize our alerts to help our support engineers and junior DBAs?

Solution

Let's look at a reporting feature over an alerting feature and where we may want to use one over the other in a few key cases:

  1. Does the event halt customers, clients or colleagues to where nothing can happen on their end? If the answer is yes, we want an alert.
  2. Does the event need to be solved as soon as possible, as it could become worse (such as database corruption, even if the database can still function)? If the answer is yes and we cannot wait to solve this, we want an alert.
  3. Will the event happen multiple times a day? We may want an alert if the first item applies, but if this doesn't stop anything for customers, clients or colleagues, we risk interrupting people about an event that may not need immediate attention.
  4. Related to point three, do we want to know how many times the event happened, such as an automatic application or server restart if CPU or memory crosses a threshold? If so, reporting offers a better option with a daily summary over alerting, as this event may happen multiple times and we want to know those details but without stopping any work.
The final point is key especially if we're using the report to gather a baseline about what is normal with our application and what we can expect on a daily, weekly or monthly basis. What we'll look at in this tip is creating a useful summary that we can send out on a scheduled basis, such as daily or weekly, instead of sending 30 messages a day that people learn to ignore. This becomes especially useful when we want to integrate our alerts and summaries with Support, SRE and other teams that will be our first line of assessment who can then filter meaningful information to others. For example, consider the phrase "We had 30 SQL Server restarts last week" in two situations where that is normal and abnormal, but without any context or filtering: not everyone will know if that's normal or not and not everyone should be informed without context.

Creating SQL Server Error Message Table and Loading Data

One design for this where we can consolidate messages from different sources (such as the server error log, custom messages from objects we might be monitoring for changes, custom application logs within our database, etc) is to use a two column table with a message column and a date and time column.

For this tip, we'll be using the below structure for our examples on alerting - the below code also shows insert statements from various sources we'd use (one of which would be a custom application log in SQL Server). How would we know when to add more possible messages to this table? Anything we would want to track and alert over time - such as 30 messages.

CREATE TABLE serverErrorLog (
	TextMessage VARCHAR(MAX),
	LogDate DATETIME
)

-- Examples of information we might save

---- Saving details from the server error log:
DECLARE @save TABLE(LogDate DATETIME, ProcessInfo VARCHAR(10), [Text] TEXT)

INSERT INTO @save
EXEC ('EXEC sp_readerrorlog 0,1')

INSERT INTO serverErrorLog (TextMessage, LogDate) 
SELECT [Text], [LogDate] FROM @save -- Note that we may want to use filters for errors we want aggregates on

Here are some other examples of saving custom messages:

---- Saving custom details from object information
INSERT INTO serverErrorLog (TextMessage, LogDate) 
SELECT ('Procedure change: ' + [name]), modify_date FROM sys.procedures WHERE modify_date > DATEADD(MINUTE,-15,GETDATE())

---- Here is an example of saving details from a custom application log if one exists
INSERT INTO serverErrorLog (TextMessage, LogDate) 
SELECT ErrorMessage, ErrorDate FROM tbETLLoader

Finding SQL Server Errors

We'll start with an example where we want an alert - it involves the first situation from our list - a fatal error in an application that will disrupt clients and colleagues. Just one message involving fatal error counts (as we see a warning telling us there have been 48 fatal errors) would be enough to raise an alert.

DECLARE @count SMALLINT, @message VARCHAR(MAX)

SELECT @count = COUNT(TextMessage)
FROM serverErrorLog
WHERE LogDate > DATEADD(MINUTE,-1,GETDATE()) AND TextMessage LIKE 'Fatal error%'

SET @message = 'Fatal errors detected: ' + CAST(@count AS VARCHAR(3))

SELECT @message MessageTest

-- use this to send email message
EXEC msdb.dbo.sp_send_dbmail  
    @profile_name = 'alt1'
    , @recipients = '[email protected]'
    , @subject = @message
    , @body = @message
Image 1

Let's evaluate this and why it would qualify as the first item from our list - meaning we'd issue an alert:

  • We have a table where we're storing errors we want to use for alerts or reports.
  • We've defined a fatal error as an alert (AND TextMessage LIKE 'Fatal error%') and counted how many of these we found. For every alert, we would want a defined rule and in some cases, we may have a 100 rules that could possibly worth raising an alert.
  • We're looking every minute for fatal errors; when we raise alerts on issues, we want to find those issues fast. In this case, we see the application found 2 fatal errors in the last minute.
  • Our message (and in this case, subject) communicates exactly what we want. If we wanted more detail, such as an alert with the CHECKDB output, we would include that in the body with a subject of Integrity Check Failed or CHECKDB Failed. Remember security here: we only want an alert to communicate what happened in some situations; other situations may allow us more flexibility, but be careful about communicating too much in an alert.
In order to follow these rules about alerting, we have to step through this - what qualifies as an alert, how are we going to capture these possible alerts, and how often are we going to check for the alerts. We also see that we'll be emailing an alert email ([email protected]) for this issue; in some situations we may be alerting support, SREs, or others. This structure works with the first and second item where we want to know as soon as possible to resolve. What about situations where the third and fourth point apply?

In the below code, query the same table with this query reviewing the error messages and how many times we received the error messages the previous day. We'll notice the fatal errors return as well as other errors which we didn't have alerts because they fit the third and four item from our list. In some cases, we'll want to exclude alerting items on a daily report, but they can add context - for instance, other warnings which we don't alert may be predicting a future error, which we do want an alert. In cases like that, we may want to add an alert if we discover too many warnings.

SELECT 
    TextMessage
  , COUNT(TextMessage) CountMessages
FROM serverErrorLog
WHERE LogDate > DATEADD(DD,-1,GETDATE())
GROUP BY TextMessage
Image 2

One of the messages that we see returned in the above query relates to login attempts. I included this example intentionally because warnings (or errors) like this can be useful to have context. Returning to the example I listed above where someone tells you, "We had 30 SQL Server restarts last week" - how do we know if that's normal or not? We would want to compare it to our historical data. The same is true with events like login failures. We may average one or two login failures daily, but a day where we have 5 or more could be a red flag. Context matters. In these cases, we'll extend our query:

;WITH GetAverage AS(
   SELECT 
        TextMessage
      , CAST(LogDate AS DATE) DateMessages
      , COUNT(TextMessage) CountMessages
   FROM serverErrorLog
   WHERE TextMessage LIKE 'Login attempts%'
   GROUP BY TextMessage, CAST(LogDate AS DATE)
), UseAverage AS(
   SELECT 
        AVG(CountMessages) AvgMessages
      , STDEV(CountMessages) StDevMessages
   FROM GetAverage
)
SELECT *
   , (SELECT AvgMessages FROM UseAverage) AvgMessages
   , ROUND((SELECT StDevMessages FROM UseAverage),0) StDevMessages 
FROM GetAverage
Image 3

In the above query, we group by the date of the login attempts in our first common table expression and get the average and standard deviation in our second common table expression. From here, we can compare the daily logins with the average and deviation. For the sake of this example, we'll look at the query result. We can actually define a rule with a where clause involving our average and deviation, depending on how strict we want our daily logins to conform to what we expect. Would this be appropriate for all messages we may get on a daily or weekly report? No. This type of reporting is appropriate when we want to compare our daily amount with our historic baseline.

Up to this point we've created alerts on errors that are high priority, we've created a daily report of the previous day's errors and warnings, and we've even created a report that allows us to compare our daily result with the history. Since we're using a table structure similar to SQL Server's error log with at least a date and message, we can simplify this by adding a column indicating priority. In the below code, we do this.

ALTER TABLE serverErrorLog ADD MessagePriority TINYINT DEFAULT 0

In this case, we would report our errors with a priority attached, such as 1 indicating the highest priority (alert must be issued) along with other numbers for lower priorities. Using our first query from above this as an example, if we use a priority column, we now filter by this and issue the alert (which means that any message in the last minute that are priority 1 are sent) along with the messages of those priority 1 alerts. As a security reminder, we should be careful about sending the actual alert - so we might want to re-use the subject here to just report the count of priority 1 alerts, if security is a top priority.

DECLARE @count SMALLINT, @subject VARCHAR(1000), @message VARCHAR(MAX) = '<p>Fatal Errors:</p>' 

SELECT @count = COUNT(TextMessage) 
FROM serverErrorLog 
WHERE LogDate > DATEADD(DD,-1,GETDATE()) AND MessagePriority = 1 
	
SELECT @message += '<p>' + TextMessage + '</p>' 
FROM serverErrorLog 
WHERE LogDate > DATEADD(DD,-1,GETDATE()) AND MessagePriority = 1 
   
SET @subject = 'Priority 1 alerts in the last minute: ' + CAST(@count AS VARCHAR(3)) 

SELECT @subject SubjectTest, @message MessageTest 

-- use this to send email message
EXEC msdb.dbo.sp_send_dbmail 
     @profile_name = 'alt1' 
   , @recipients = '[email protected]' 
   , @subject = @subject 
   , @body = @message	
Image 3

This can simplify our querying and organization, but we have to design for it, and we have to demarcate categories for our alerting, warnings and reporting. In smaller environments, this design may be too much whereas this design works well for large environments, as prioritizing which issues should be resolved as soon as possible becomes paramount.

Next Steps
  • Any situation that must be solved immediately should be sent as an alert. All other situations that can take time are not alerts and they undermine what an actual alert is by potentially confusing recipients. In some cases, we can use a reporting feature for immediate alerts, if we have an automated process that will attempt to resolve the reported alert and a failure would issue an alert by contact, such as email, call or other format.
  • Situations that should be reviewed on a timely basis - but are not situations that must be solved immediately - should be reported in a review format. We can use a review email for the time period, such as the last day, or we can use an SSRS or other reporting format.
  • Over time, we can use a priority system for alerts, warnings and other situations. We may redefine rules because we want to avoid warnings or other situations that weren't initially defined as alerts.


Last Updated: 2019-02-08


next webcast button


next tip button



About the author
MSSQLTips author Tim Smith Tim Smith works as a DBA and developer and also teaches Automating ETL on Udemy.

View all my tips
Related Resources




Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name    *Email    Email me updates 


Signup for our newsletter
 I agree by submitting my data to receive communications, account updates and/or special offers about SQL Server from MSSQLTips and/or its Sponsors. I have read the privacy statement and understand I may unsubscribe at any time.



    



Learn more about SQL Server tools