How to Find Long Running Jobs in Microsoft SQL Server

By: Thomas LaRock | Comments (5) | Related: > SQL Server Agent

Problem

You want to know if a job inside of SQL Server Agent is running longer than it should. This leads to a second problem: how do you define “long”? Well, for me, I use math.

Solution

SQL Server persists data inside the msdb database for job history between restarts. Therefore, we can do analysis on the job history data to find if there are jobs running long. Here, we will define “running long” as jobs that are currently running longer than two standard deviations away from the mean (i.e., an outlier).

Variables

There are four variables in the script.

@MinHistExecutions - Minimum number of job executions we want to consider. I like to set this to five or seven, giving me a small sample size to serve as a baseline to get started.

@MinAvgSecsDuration - Threshold for minimum job duration. If set to 300, then we are not concerned with any jobs that have a historical average less than five minutes in duration. This also has implications for how often I will want to poll for long running jobs. If we poll more than 300 seconds apart (say, every 15 minutes), then we may miss an outlier because the code will filter for jobs that are currently running.

@HistoryStartDate - Start date for historical average you want to evaluate.

@HistoryEndDate - End date for historical average you want to evaluate. The default values for start and end dates will capture all job history, but you may want to consider using a smaller window, say 90 days.

Comments

One result set containing a list of jobs that are currently running and the duration is more than two standard deviations away from the historical average. The “Min Threshold” column represents the average plus two standard deviations (in seconds).

/*=============================================
  Variables:
    @MinHistExecutions - Minimum number of job executions we want to consider 
    @MinAvgSecsDuration - Threshold for minimum job duration we care to monitor
    @HistoryStartDate - Start date for historical average
    @HistoryEndDate - End date for historical average
 
  These variables allow for us to control a couple of factors. First
  we can focus on jobs that are running long enough on average for
  us to be concerned with (say, 30 seconds or more). Second, we can
  avoid being alerted by jobs that have run so few times that the
  average and standard deviations are not quite stable yet. This script
  leaves these variables at 1.0, but I would advise you alter them
  upwards after testing.
 
  Returns: One result set containing a list of jobs that
  are currently running and are running longer than two standard deviations 
  away from their historical average. The "Min Threshold" column
  represents the average plus two standard deviations. 

  note [1] - comment this line and note [2] line if you want to report on all history for jobs
  note [2] - comment just this line is you want to report on running and non-running jobs
 =============================================*/
 
DECLARE  @HistoryStartDate datetime 
  ,@HistoryEndDate datetime  
  ,@MinHistExecutions int   
  ,@MinAvgSecsDuration int  
 
SET @HistoryStartDate = '19000101'
SET @HistoryEndDate = GETDATE()
SET @MinHistExecutions = 1.0
SET @MinAvgSecsDuration = 1.0
 
DECLARE @currently_running_jobs TABLE (
    job_id UNIQUEIDENTIFIER NOT NULL
    ,last_run_date INT NOT NULL
    ,last_run_time INT NOT NULL
    ,next_run_date INT NOT NULL
    ,next_run_time INT NOT NULL
    ,next_run_schedule_id INT NOT NULL
    ,requested_to_run INT NOT NULL
    ,request_source INT NOT NULL
    ,request_source_id SYSNAME NULL
    ,running INT NOT NULL
    ,current_step INT NOT NULL
    ,current_retry_attempt INT NOT NULL
    ,job_state INT NOT NULL
    ) 
 
--capture details on jobs
INSERT INTO @currently_running_jobs
EXECUTE master.dbo.xp_sqlagent_enum_jobs 1,''
 
;WITH JobHistData AS
(
  SELECT job_id
 ,date_executed=msdb.dbo.agent_datetime(run_date, run_time)
 ,secs_duration=run_duration/10000*3600
                      +run_duration%10000/100*60
                      +run_duration%100
  FROM msdb.dbo.sysjobhistory
  WHERE step_id = 0   --Job Outcome
  AND run_status = 1  --Succeeded
)
,JobHistStats AS
(
  SELECT job_id
        ,AvgDuration = AVG(secs_duration*1.)
        ,AvgPlus2StDev = AVG(secs_duration*1.) + 2*stdevp(secs_duration)
  FROM JobHistData
  WHERE date_executed >= DATEADD(day, DATEDIFF(day,'19000101',@HistoryStartDate),'19000101')
  AND date_executed < DATEADD(day, 1 + DATEDIFF(day,'19000101',@HistoryEndDate),'19000101') GROUP BY job_id HAVING COUNT(*) >= @MinHistExecutions
  AND AVG(secs_duration*1.) >= @MinAvgSecsDuration
)
SELECT jd.job_id
      ,j.name AS [JobName]
      ,MAX(act.start_execution_date) AS [ExecutionDate]
      ,AvgDuration AS [Historical Avg Duration (secs)]
      ,AvgPlus2StDev AS [Min Threshhold (secs)]
FROM JobHistData jd
JOIN JobHistStats jhs on jd.job_id = jhs.job_id
JOIN msdb..sysjobs j on jd.job_id = j.job_id
JOIN @currently_running_jobs crj ON crj.job_id = jd.job_id --see note [1] above
JOIN msdb..sysjobactivity AS act ON act.job_id = jd.job_id
AND act.stop_execution_date IS NULL
AND act.start_execution_date IS NOT NULL
WHERE DATEDIFF(SS, act.start_execution_date, GETDATE()) > AvgPlus2StDev
AND crj.job_state = 1 --see note [2] above
GROUP BY jd.job_id, j.name, AvgDuration, AvgPlus2StDev

Summary

Traditional alerts that focus on job duration are not considering the historical average. As a result, operators are notified a job is running long based upon a fixed amount of time, say 300 seconds. A better method is to do some analysis on the data already stored inside the msdb. This way if an alert is sent you know that action is needed.

Next Steps

About the author

Thomas LaRock is a Head Geek at SolarWinds and a Microsoft Certified Master, Microsoft Data Platform MVP, VMware vExpert, and a former Microsoft Certified Trainer with over 20 years’ experience.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips

Comments For This Article

Thursday, August 20, 2020 - 4:28:44 PM - Sebastien Cornu

Thanks for this very helpful article.

I made few changes in your code. It was not working properly for me because my jobs have several steps, so the job duration was not well calculated. And I think it was important to add a column with the real duration.

I simplified and customized it on the display.

Here's the code:
________________________________________________________________
/*=============================================
Job longer than usual

Parameters:
- Avg duration > 1 min
- Within 7 days

Returns: One result set containing a list of jobs that
are currently running or are running longer than two standard deviations
away from their historical average. The "Duration Threshold" column
represents the average plus two standard deviations.
=============================================*/

DECLARE @MinAvgSecsDuration int
SET @MinAvgSecsDuration = 60

;WITH JobHistData AS
(
SELECT job_id
,date_executed=convert(int, convert(varchar(10), run_date, 112))
,secs_duration=sum(run_duration/10000*3600
+run_duration%10000/100*60
+run_duration%100)
FROM msdb.dbo.sysjobhistory
WHERE step_id = 0 -- Job Outcome
AND run_status > 0 -- 1 Succeeded
AND run_status < 3 -- 2 Retry
GROUP BY job_id, run_date
)

,JobHistStats AS
(
SELECT job_id
,AvgDuration = AVG(secs_duration)
,AvgPlus2StDev = AVG(secs_duration) + 2*stdevp(secs_duration)
FROM JobHistData
GROUP BY job_id HAVING AVG(secs_duration) >= @MinAvgSecsDuration
)
SELECT -- jd.job_id,
j.name AS [JobName]
,CONVERT(date, CONVERT(varchar(8), date_executed), 112) AS [ExecutionDate]
,CONVERT(TIME(0), DATEADD(SS,AvgDuration,0),108) AS [Historical Avg Duration]
,CONVERT(TIME(0), DATEADD(SS,AvgPlus2StDev,0),108) AS [Duration Threshhold]
,CONVERT(TIME(0), DATEADD(SS,jd.secs_duration,0),108) AS [Duration]
FROM JobHistData jd
JOIN JobHistStats jhs on jd.job_id = jhs.job_id
JOIN msdb..sysjobs j on jd.job_id = j.job_id
WHERE jd.secs_duration > AvgPlus2StDev
AND date_executed > convert(int, convert(varchar(10), DATEADD(day, -7, getdate()), 112)) -- jobs longer than usual within 7 days
GROUP BY j.name, AvgDuration, AvgPlus2StDev, jd.date_executed, jd.secs_duration
order by j.name, jd.date_executed desc
_____________________________________________

Friday, July 6, 2018 - 1:51:18 PM - Kristopher Hokanson

Been using very similar logic for a few years now, love that you posted this great solution for everyone to use.

I wonder if it might be better to put the filter for the job duration in the final select statement instead of filtering on the history. This way you can catch the job that has historically been running in 10 seconds but is suddenly taking 5 minutes.

Otherwise great stuff! Thank you for putting it together.

Tuesday, July 4, 2017 - 3:24:33 AM - yogender rana

Great artical, thanks Thomas :)

Tuesday, June 6, 2017 - 10:54:56 AM - Charles

i ran this on 2012, even when i comment out 1 and 2, did not get any result.

Thursday, June 1, 2017 - 4:57:06 PM - AZJim

I have found the following code helpful. I think to do what you are doing.

select

r.start_time AS 'CmdStartTime'

, r.status AS 'CmdStatus'

, r.command AS 'ShortCommand'

, DB_NAME(r.database_id) AS 'DataBase'

, r.wait_type AS 'CmdWaitType'

, r.blocking_session_id AS 'CmdBlockingSessionID'

, r.percent_complete AS 'CmdPctComplete'

, CASE WHEN ((r.estimated_completion_time/1000)/3600) < 10 THEN '0' +

CONVERT(VARCHAR(10),(r.estimated_completion_time/1000)/3600)

ELSE CONVERT(VARCHAR(10),(r.estimated_completion_time/1000)/3600)

END + ':' +

CASE WHEN ((r.estimated_completion_time/1000)%3600/60) < 10 THEN '0' +

CONVERT(VARCHAR(10),(r.estimated_completion_time/1000)%3600/60)

ELSE CONVERT(VARCHAR(10),(r.estimated_completion_time/1000)%3600/60)

END + ':' +

CASE WHEN ((r.estimated_completion_time/1000)%60) < 10 THEN '0' +

CONVERT(VARCHAR(10),(r.estimated_completion_time/1000)%60)

ELSE CONVERT(VARCHAR(10),(r.estimated_completion_time/1000)%60)

END AS 'TimeRemainingForCmd'

, CAST(((r.total_elapsed_time/1000.0)/60.0) AS dec(5,1)) AS 'CmdElapsedMinutes'

, sqltxt.text AS 'CmdFullSQLCommandText'

FROM msdb.sys.dm_exec_requests r

OUTER APPLY msdb.sys.dm_exec_sql_text(r.sql_handle) sqltxt

WHERE command LIKE 'BACKUP%' -- Remember, some servers are case sensitive

OR command LIKE 'RESTORE%'

OR command LIKE 'DBCC%'

OR command LIKE 'KILL%'

OR command LIKE 'UPDATE STAT%'

OR command like 'ALTER%' ;