Problem
In a previous post, we examined how to find if a job inside of SQL Server Agent is running longer than it should. In this post, we take a similar approach but shift our focus to individual job steps.
Solution
SQL Server persists data inside the msdb database for job history between restarts. The job history also includes data on the individual job steps. Therefore, we can do analysis on the job step history data to find if there are job steps running long. Here, we will define “running long” as job steps that are currently running longer than two standard deviations away from the mean (i.e., an outlier).
Variables
There are four variables in the script.
- @MinHistExecutions – Minimum number of job step executions we want to consider. I like to set this to five or seven, giving me a small sample size to serve as a baseline to get started.
- @MinAvgSecsDuration – Threshold for minimum job step duration, in seconds. If set to 300, then we are not concerned with any jobs that have a historical average less than five minutes in duration. This also has implications for how often I will want to poll for long running job steps. If we poll more than 300 seconds apart (say, every 15 minutes), then we may miss an outlier because the code will filter for jobs that are currently running.
- @HistoryStartDate – Start date for historical average you want to evaluate.
- @HistoryEndDate – End date for historical average you want to evaluate. The default values for start and end dates will capture all job history, but you may want to consider using a smaller window, say 90 days.
Comments
One result set containing a list of job steps that are currently running and the duration is more than two standard deviations away from the historical average. The “Min Threshold” column represents the average plus two standard deviations (in seconds).
Code Example
/*=============================================
Variables:
@MinHistExecutions - Minimum number of job step executions we want to consider
@MinAvgSecsDuration - Threshold for minimum job step duration we care to monitor
@HistoryStartDate - Start date for historical average
@HistoryEndDate - End date for historical average
These variables allow for us to control a couple of factors. First
we can focus on job steps that are running long enough on average for
us to be concerned with (say, 30 seconds or more). Second, we can
avoid being alerted by job steps that have run so few times that the
average and standard deviations are not quite stable yet. This script
leaves these variables at 1.0, but I would advise you alter them
upwards after testing.
Returns: One result set containing a list of job steps that
are currently running and are running longer than two standard deviations
away from their historical average. The "Min Threshold" column
represents the average plus two standard deviations.
note [1] - comment this line and note [2] line if you want to report on all history for job steps
note [2] - comment just this line is you want to report on running and non-running job steps
=============================================*/
DECLARE @HistoryStartDate datetime
,@HistoryEndDate datetime
,@MinHistExecutions int
,@MinAvgSecsDuration int
SET @HistoryStartDate = '19000101'
SET @HistoryEndDate = GETDATE()
SET @MinHistExecutions = 1.0
SET @MinAvgSecsDuration = 1.0
DECLARE @currently_running_jobs TABLE (
job_id UNIQUEIDENTIFIER NOT NULL
,last_run_date INT NOT NULL
,last_run_time INT NOT NULL
,next_run_date INT NOT NULL
,next_run_time INT NOT NULL
,next_run_schedule_id INT NOT NULL
,requested_to_run INT NOT NULL
,request_source INT NOT NULL
,request_source_id SYSNAME NULL
,running INT NOT NULL
,current_step INT NOT NULL
,current_retry_attempt INT NOT NULL
,job_state INT NOT NULL
)
--capture details on jobs
INSERT INTO @currently_running_jobs
EXECUTE master.dbo.xp_sqlagent_enum_jobs 1,''
;WITH JobStepsHistData AS
(
SELECT job_id, step_id
,date_executed=msdb.dbo.agent_datetime(run_date, run_time)
,secs_duration=run_duration/10000*3600
+run_duration%10000/100*60
+run_duration%100
FROM msdb.dbo.sysjobhistory
WHERE run_status = 1 -- Succeeded
)
,JobHistStats AS
(
SELECT job_id, step_id
,AvgDuration = AVG(secs_duration*1.)
,AvgPlus2StDev = AVG(secs_duration*1.) + 2*stdevp(secs_duration)
FROM JobStepsHistData
WHERE date_executed >= DATEADD(day, DATEDIFF(day,'19000101',@HistoryStartDate),'19000101')
AND date_executed < DATEADD(day, 1 + DATEDIFF(day,'19000101',@HistoryEndDate),'19000101')
GROUP BY job_id, step_id
HAVING COUNT(*) >= @MinHistExecutions
AND AVG(secs_duration*1.) >= @MinAvgSecsDuration
)
-- need to select from the CTE's, and join to msdb for final result
SELECT jd.job_id
,j.name AS [JobName]
,sjs.step_id
,sjs.step_name
,MAX(act.start_execution_date) AS [ExecutionDate]
,AvgDuration AS [Historical Avg Duration (secs)]
,AvgPlus2StDev AS [Min Threshhold (secs)]
FROM JobStepsHistData jd
JOIN JobHistStats jhs on jd.job_id = jhs.job_id AND jd.step_id = jhs.step_id
JOIN msdb..sysjobs j on jd.job_id = j.job_id
JOIN msdb..sysjobsteps sjs on jd.job_id = sjs.job_id AND jd.step_id = sjs.step_id
JOIN @currently_running_jobs crj ON crj.job_id = jd.job_id --see note [1] above
JOIN msdb..sysjobactivity AS act ON act.job_id = jd.job_id
AND act.stop_execution_date IS NULL
AND act.start_execution_date IS NOT NULL
WHERE DATEDIFF(SS, act.start_execution_date, GETDATE()) > AvgPlus2StDev
AND crj.job_state = 1 --see note [2] above
GROUP BY jd.job_id, j.name, sjs.step_id, sjs.step_name, AvgDuration, AvgPlus2StDev
Summary
Traditional alerts that focus on job duration are not considering the historical average. As a result, operators are notified that a job is running long based upon a fixed amount of time, say 300 seconds. A better method is to do some analysis on the data already stored inside the msdb. By examining the duration of specific job steps you can pinpoint where the bottleneck is happening. This way if an alert is sent you know that action is needed.
Next Steps
- Copy the code and run it against your msdb database.
- Consider running the code on a schedule and alert if any rows are returned.
- Alternatively, you can run report daily to examine job steps that are running long.
- Querying SQL Server Agent Job Information
- Custom job categories to organize your SQL Agent jobs
- All SQL Agent Tips

Thomas LaRock is a Head Geek at SolarWinds and a Microsoft Certified Master, Microsoft Data Platform MVP, VMware vExpert, and former Microsoft Certified Trainer. He has over 20 years’ experience in the IT industry in roles including programmer, developer, analyst, and database administrator.
LaRock has spent much of his career focused on data and database administration, which led to his being chosen as a Technical Evangelist for Confio Software in 2010. While at Confio, his research and experience helped to create the initial versions of the software now known as SolarWinds Database Performance Analyzer. LaRock joined the SolarWinds family through the acquisition of Confio in 2013.
LaRock is also the Immediate Past President of the Professional Association for SQL Server (PASS) and is an avid blogger, author, and technical reviewer for numerous books about SQL Server management. He now focuses his time working with customers to help resolve problems and answer questions regarding database performance tuning and virtualization for SQL Server, Oracle, MySQL, SAP, and DB2, making it his mission to give IT and data professionals longer weekends.


