SQL Server Agent Job Advanced Management

By:   |   Comments   |   Related: > SQL Server Agent


When we set up a SQL Server Agent Job, from time to time there may be some special requirements on how jobs should behave.  How can we support these special instances in SQL Server?

For example:

  1. Auto Delete - We wish to delete the job automatically after a specific time, i.e. one week later.
  2. Failure Notifications - If the job continuously fails and if it is configured to send failure notification, we want the notifications to stop after [N] times.
  3. Auto Disable - If the job continuously fails for [N] times, we wish the job to be auto-disabled (and send an email to notify the DBA of this fact).
  4. Scheduling - If the job has a complex schedule, we want to avoid setting up multiple schedules, such as on Mon/Tue/Wed, the job will run hourly between 8:00 AM to 5:00 PM and 8:00 PM to 2:00 AM, but on Thu/Fri, the job will run hourly between 9:00 AM to 2:00 PM and 7:00 PM to 12:00 AM, while on Sat/Sun, it will run hourly between 10:00 AM to 3:00 PM and 10:00 PM to 3:00 AM.
  5. Delayed Execution - If a job continuously fails for [N] times, we want the job to not run in the next [X] hours before running again.

With the help of SQL Server Agent tokens, we can come up with corresponding solutions to each requirement.  The generic approach is to add one header step as the first job step that processes the various requirements then makes the necessary reconfiguration to the SQL Server Agent Job dynamically. Now will take a look at each of the above-mentioned requirements.

Auto Delete SQL Server Agent Jobs

The first requirement is to delete the job after a specific date.  As you may know, in a SQL Server Agent Job there is an option to delete the job as shown below:

SQL Server Agent Job Self Deletion Configuation

So to address the job deletion requirement, all we need to do is add a header job step and put the following code in the step.

The logic is very straight forward. We will check whether the current time is past a specified time, if so, we will set the self deletion configuration and then raise an error to fail the step, which has the default [On failure action] setting (i.e. Quit the job reporting failure) as shown below:

SQL Server Agent Job Step Configuration

declare @job_id uniqueidentifier;
set @job_id=CONVERT(uniqueidentifier, $(ESCAPE_NONE(JOBID))); -- token is used

if getdate() > '2016-07-15 17:00:00' -- change to your preferred end time
    EXEC msdb.dbo.sp_update_job @[email protected]_id, @delete_level=2; -- set auto deletion
    raiserror ('self deletion', 16, 1); -- cause the step failure

Failure Notifications and Auto Disable SQL Server Agent Jobs

The second and third requirements are basically the same from a coding perspective. The basic design is to check the msdb.dbo.sysjobhistory table and if we find [N] continuous failures, we will take the following actions:

  1. Disable the notification (or disable the job).
  2. Send out an email to notify the DBA team about the disable action.
  3. If a next run of the job succeeds, we should reset the notification; i.e. enabling the notification until there are another N continuous failures.

Of course, if you disable the job, the job will never run until enabled outside of the job (because a disabled job cannot enable itself).

set nocount on;
declare @Failure_Number int =3; -- set your preferred N times before you want to take actions
declare @cnt int;
declare @subject varchar(200), @body varchar(512), @job_name varchar(128);
declare @myemail varchar(300) = '[email protected]'; -- replace with your own email account

declare @JOB_ID uniqueidentifier;

set @job_id = convert(uniqueidentifier, $(ESCAPE_NONE(JOBID)));

select @job_name = name from msdb.dbo.sysjobs where job_id = @job_id;

; with c as (
select top (@Failure_Number) instance_id, run_status
from msdb.dbo.sysjobhistory
where job_id [email protected]_id
and step_id=0
order by instance_id desc
select @cnt = count(*) 
from c
where run_status = 0

if (@[email protected]_Number)
    -- exec msdb.dbo.sp_update_job @job_id = @job_id, @enabled=0; -- un-comment this line if you want to disable the job
    if (select notify_email_operator_id from msdb.dbo.sysjobs where [email protected]_id) > 0
  EXEC msdb.dbo.sp_update_job @[email protected]_id, @notify_email_operator_name=N'';
  set @subject = 'Job [' + @job_name +']  failure notification is disabled';
  set @body = 'this job''s failure notification is curently disabled after ' + cast(@failure_number as varchar(10)) + ' continuous failures;'
  exec msdb.dbo.sp_send_dbmail @[email protected], @[email protected], @body = @body; 

else -- if there is a success after [@Failure_Number] failures, we will reenable the notification
    if (select notify_email_operator_id from msdb.dbo.sysjobs where [email protected]_id) = 0
  EXEC msdb.dbo.sp_update_job @[email protected]_id, @notify_level_email=2, @notify_email_operator_name=N'DBA_Test'; -- replace with your own OPERATOR name here

SQL Server Agent Scheduling

For the fourth requirement, a possible solution is to have an if condition to see whether the current time is within the specified time window, if so, exit the current step and continue to the next step, otherwise, raise an error to quit the job.

To prevent the job logging the failure status on step 1 (as we purposely fail this step just for quitting the job), we can configure the step as [On failure action]=[Quit the job reporting success] as shown below:

SQL Server Agent Job Step Configuration to Quit the Job Reporting Success

-- I am hard-coding the date time here for simplicity, but a better way is to create a table to hold all these values for flexibility

declare @dt datetime = getdate();
declare @time varchar(10) = convert(varchar(10), @dt, 108);
if ( (( @time between '08:00:00' and '17:00:00') or
     ( @time between '20:00:00' and '23:59:59')  or
     ( @time between '00:00:00' and '02:00:00')) and
  ( datename(dw, @dt) in ('Monday','Tuesday', 'Wednesday'))
   ) -- Mon/Tue/Wed, between 8:00am to 5:00pm, 8:00pm to 2:00am
   ( (( @time between '09:00:00' and '14:00:00') or
     ( @time between '19:00:00' and '23:59:59')) and
  ( datename(dw, @dt) in ('Thursday','Friday'))
   ) -- Thur/Fri, between 9:00am to 2:00pm, 7:00pm to 12:00am
   ( (( @time between '10:00:00' and '15:00:00') or
     ( @time between '22:00:00' and '23:59:59')  or
     ( @time between '00:00:00' and '03:00:00')) and
  ( datename(dw, @dt) in ('Thursday','Friday'))
   ) -- Sat/Sun, between 10:00am to 3:00pm, 10:00pm to 3:00am
else -- out of allowed windows
 raiserror ('Not in the allowed window', 15, 1);

If we prefer not hard-coding values, we can use the following sample code by putting values into a table:

use msdb; -- you can choose your own user database

create table msdb.dbo.Schedule (StartTime char(8), Endtime char(8), [WeekDay] varchar(12)) -- populate it with needed values

insert into msdb.dbo.Schedule (StartTime, EndTime, [WeekDay]) values
  ('08:00:00', '17:00:00', 'Monday'), ('20:00:00', '23:59:59', 'Monday'), ('00:00:00', '02:00:00', 'Monday')
, ('08:00:00', '17:00:00', 'Tuesday'), ('20:00:00', '23:59:59', 'Tuesday'), ('00:00:00', '02:00:00', 'Tuesday')
, ('08:00:00', '17:00:00', 'Wednesday'), ('20:00:00', '23:59:59', 'Wednesday'), ('00:00:00', '02:00:00', 'Wednesday')
, ('09:00:00', '14:00:00', 'Thursday'), ('19:00:00', '23:59:59', 'Thursday')
, ('09:00:00', '14:00:00', 'Friday'), ('19:00:00', '23:59:59', 'Friday')
, ('10:00:00', '15:00:00', 'Saturday'), ('22:00:00', '23:59:59', 'Saturday'), ('00:00:00', '03:00:00', 'Saturday')
, ('10:00:00', '15:00:00', 'Sunday'), ('22:00:00', '23:59:59', 'Saturday'), ('00:00:00', '03:00:00', 'Sunday')
declare @dt datetime = getdate();
declare @time varchar(10) = convert(varchar(10), @dt, 108);
if exists ( select * from msdb.dbo.Schedule where (@time between StartTime and EndTime) and datename(dw, @dt) = [WeekDay] )
   return -- move on to next job step
   raiserror ('Not in the allowed window', 15, 1);

SQL Server Agent Job Delayed Execution

The fifth requirement is relatively more complex in coding. What we do here is to update the job's associated schedules, so the next start time is [X] time later.

To simplify the work, we assume the job has only one schedule and it is running daily every [x] intervals, where "interval" can be hour/minute/second.

To achieve this work, we will use an external table to avoid hard-coding.

The basic business requirement is:

  • If a job fails continuously [X] times, then we want the job to not run for the next [Y] interval, where interval can be hour/minute/second.
  • After [Y] interval, we will reset the conditions, and from now on if the job fails continuously [X] times, another [Y] interval will be implemented, and so on so forth.

The requirement is challenging to implement with our current architecture, i.e. the first job step is our controlling step. The reason is simple, let's say we want an hourly job to NOT run for 3 hours after 3 continuous failures. Here is an example:

  • Say a job fails at 9:00 AM, 10:00 AM and 11:00 AM, according to the requirement, the job should not run for the next 3 hours, i.e. not run at 12:00 PM/1:00 PM /2:00 PM, but will run again at 3:00 PM. But when the job started at 11:00 AM, in job step 1, we cannot predict whether the job will fail or not, so we cannot do anything. As such, we need the job to still run at 12:00 PM and until then we can evaluate whether the job had failed 3 times already. If so, we simply quit the job in step 1 without running all the following job steps and on the other hand, we can reset the job schedule time to start at 3:00 PM (i.e. 12:00 PM + 3 hours). This way, the job's behavior is run at 12:00 PM for job step 1 only, and not run at 1:00 PM/2:00 PM and starts again at 3:00 PM.

Here is the implementation, we first create a table and populate it (assuming we are targeting a job called [test_1] which has two steps. The first step is our controlling step with configuration [On failure action]=[Quit the job reporting success] and the second step only has a "raiserror" statement to always cause the job failure. Job [test_1] is scheduled to run every minute between 12:00:00am and 11:59:59pm.

use msdb -- you can use other user db if needed
if object_id('dbo.JobData', 'U') is not null
 drop table dbo.JobData

create table dbo.JobData (
  JobName sysname 
, FailureNumber int default 3 -- how many continuous failures will occur before the delay action will kick in, default to 3
, DelayTime int default 0 -- how much time do you want to dealy? In HHMMSS format, so 20304 means dealy 2 hrs 3 min and 4 seconds
, IsDelayed bit -- 0 = not delayed, still using the original StartTime, 1=delayed, using the RevisedStartTime
, OriginalStartTime int -- original StartTime, in HHMMMSS format
, OriginalEndTime int -- original EndTime, in HHMMMSS format
, OriginalActiveDay int -- original Active Day in YYYYMMDD format
, RevisedStartTime int -- Revised Start Time
, RevisedStartDay int -- Revised Start Day

-- populate the table first
-- we assume the job has only one schedule or if the job has multiple schedules, we will only delay the job on one schedule
insert into dbo.JobData (JobName, IsDelayed, OriginalStartTime, OriginalEndTime,OriginalActiveDay, DelayTime)
select top 1 j.name, 0, s.active_start_time, s.active_end_time, s.Active_Start_Date 
, 500 --DelayTime to be 00:05:00 i.e.  5 min delay  Example only, choose your own delay interval
from msdb.dbo.sysjobs j
inner join msdb.dbo.sysjobschedules js
on j.job_id = js.job_id
inner join msdb.dbo.sysschedules s
on s.schedule_id = js.schedule_id
where j.name='test_1' 
order by s.Active_start_time -- change to your own job name

Now we have the following code for job step 1:

use msdb
set nocount on;

if not exists (select * from msdb.sys.tables where name='JobData') return;

declare @Failure_Number int; -- set your preferred N times before you want to take actions
declare @cnt int;
declare @subject varchar(200), @body varchar(512);

declare @OrigStartTime int, @OrigEndTime int, @OrigActiveDay int, @delaytime int, @isDelayed bit, @job_name varchar(128);
declare @currTime int, @NewActiveDay int, @NewActiveTime int ;
declare @day int, @hr int, @min int, @sec int;
declare @job_id uniqueidentifier, @schedule_id int;

set @job_id = convert(uniqueidentifier, $(ESCAPE_NONE(JOBID)));
set @NewActiveday = 0;
set @currTime = cast(replace(substring(convert(varchar, getdate(), 114),1,8), ':', '') as int)

if not exists ( select * from msdb.dbo.jobdata d
    inner join msdb.dbo.sysjobs j
    on j.name = d.jobname
    and j.job_id = @job_id)
 return; -- do nothing

select @job_name=JobName
, @OrigStartTime = d.OriginalStartTime
, @OrigEndTime = d.OriginalEndTime
, @OrigActiveDay = d.OriginalActiveDay
, @delaytime = d.delaytime
, @isDelayed = d.isDelayed
, @Failure_Number = d.FailureNumber
from msdb.dbo.jobdata d
inner join msdb.dbo.sysjobs j
on j.name = d.jobname
and j.job_id = @job_id;

-- we assume there is only schedule for the job
select top 1 @schedule_id = schedule_id
from msdb.dbo.sysjobschedules
where job_id = @job_id;

-- check job history if it has failed three times, we set isDelayed = IsDelayed xor IsDelayed

;with c as (
select top (@Failure_Number) instance_id, run_status
from msdb.dbo.sysjobhistory
where job_id [email protected]_id
and step_id=0
order by instance_id desc
select @cnt = count(*) 
from c
where run_status = 0 -- 0 = failure; 1=success

if @cnt = @Failure_Number 
  if @isDelayed=0
  -- start to reset the next start time to be current time + @delayedtime
  set @sec = @currTime%100 + @delayTime%100
  select @currTime /=100, @delayTime /=100
  set @min = @currTime%100 + @delayTime%100 + @sec/60;
  set @sec = @sec%60
  select @currTime /=100, @delayTime /=100
  set @hr =  @currTime%100 + @delayTime%100 + @min/60;
  set @min = @min%60
  set @day = @hr / 24;
  set @hr = @hr %24

  set @NewActiveTime = @hr*10000+ @min*100 + @sec;

  set @NewActiveDay = case when @day > 0 then cast(convert(char(8), getdate(), 112) as int) + @day
         else @OrigActiveDay end;

  if @NewActiveTime between @OrigStartTime and @OrigEndTime -- if @NewActiveTime is out of the pre-defined time window, we do nothing
   exec msdb.dbo.sp_update_schedule @schedule_id = @schedule_id
   , @active_start_time = @NewActiveTime
   , @active_start_date = @NewActiveDay;

  update d set IsDelayed = 1, RevisedStartTime = @NewActiveTime, RevisedStartDay = @NewActiveDay
  from msdb.dbo.JobData d
  where [email protected]_name;

  -- raiserror to quit the job
  set @body='Quit the job because it has failed more than ['+ convert(varchar,@Failure_Number)+'] times';
  raiserror (@body, 15, 1); -- this will set the run_status = 1 for step_id=0, when the job quits
                            -- since the first step job is configured to [On failure action]=[Quit the job reporting success]
          -- thus, this guarantees there is no more than [@failure_number] continuous failures.

 end --not reaching [@Failure_Number] times
 if @isDelayed=1
  update d set IsDelayed = 0, RevisedStartTime=0
  from msdb.dbo.JobData d
  where [email protected]_name;

  -- reverse the schedule to normal
  exec msdb.dbo.sp_update_schedule @schedule_id = @schedule_id
   , @active_start_time = @OrigStartTime
   , @active_start_date = @OrigActiveDay;

Job step 2 has the following code:

raiserror ('Intentional failure, pls ignore!', 15, 1);

In the job history, we will see the following facts:

  • After 3 continuous failures (on step 2), the fourth failure is on step 1, because step 1 detects the previous 3 failures and thus intentionally fires a raiserror to quit (reporting success as configured).
  • The next run will be five minutes late as defined.
  • There will be another 3 continuous failures on job step 2 before the next run fails on step 1 and then the next run will be 5 minutes later.
  • The pattern repeats itself.
SQL Server Agent Job History


The tip gives us various ways to handle some interesting requirements for SQL Server Agent Job management. As a DBA, I often encounter the issues during weekends or at mid-night when scheduled SQL Server Agent Jobs are failing due to network or hardware issues or other related 3rd party applications.  By default these job failures will send out emails, but I also know that in such scenarios, I really do not care about the failures and I actually want the jobs to stop for a while before running again (hoping the network will be restored by that time). If I ignore the notifications, I will end up next morning with tens of emails, which I need to delete anyway. So in such an environment, our DBA team agrees that we want the job to not start 3 hours after continuous failure of 3 times. This way, we will greatly reduce the emails received.

In the real world, other than this method, I also rely on creating a central monitoring job process to monitor all jobs of interest to implement our management rules (disable notifications, postpone executions, etc.).

Also please note if a schedule is shared by multiple jobs, you need to understand the consequences of postponing a job by changing the shared schedule.

My provided code assumes a job runs daily from 12:00 AM to 11:59:59 PM, but if a job schedule is from 8:00 AM to 8:00 PM (running hourly), we may need to add additional logic when the job continuously fails around the boundaries. For example, our rule is if the job continuously fails 3 times, we want the job to postpone its next run by 3 hours.  So the job fails at 4:00 PM/5:00 PM/6:00 PM, at the 7:00 PM run the job should postpone itself to start at 10:00 PM (i.e. 3 hours later), but 10:00 PM is outside of the scheduled 8:00 AM to 8:00 PM time window, so what should we do?

Next Steps

sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip

About the author
MSSQLTips author Jeffrey Yao Jeffrey Yao is a senior SQL Server consultant, striving to automate DBA work as much as possible to have more time for family, life and more automation.

View all my tips

Comments For This Article