SQL Server Agent Job Advanced Management
When we set up a SQL Server Agent Job, from time to time there may be some special requirements on how jobs should behave. How can we support these special instances in SQL Server?
- Auto Delete - We wish to delete the job automatically after a specific time, i.e. one week later.
- Failure Notifications - If the job continuously fails and if it is configured to send failure notification, we want the notifications to stop after [N] times.
- Auto Disable - If the job continuously fails for [N] times, we wish the job to be auto-disabled (and send an email to notify the DBA of this fact).
- Scheduling - If the job has a complex schedule, we want to avoid setting up multiple schedules, such as on Mon/Tue/Wed, the job will run hourly between 8:00 AM to 5:00 PM and 8:00 PM to 2:00 AM, but on Thu/Fri, the job will run hourly between 9:00 AM to 2:00 PM and 7:00 PM to 12:00 AM, while on Sat/Sun, it will run hourly between 10:00 AM to 3:00 PM and 10:00 PM to 3:00 AM.
- Delayed Execution - If a job continuously fails for [N] times, we want the job to not run in the next [X] hours before running again.
With the help of SQL Server Agent tokens, we can come up with corresponding solutions to each requirement. The generic approach is to add one header step as the first job step that processes the various requirements then makes the necessary reconfiguration to the SQL Server Agent Job dynamically. Now will take a look at each of the above-mentioned requirements.
Auto Delete SQL Server Agent Jobs
The first requirement is to delete the job after a specific date. As you may know, in a SQL Server Agent Job there is an option to delete the job as shown below:
So to address the job deletion requirement, all we need to do is add a header job step and put the following code in the step.
The logic is very straight forward. We will check whether the current time is past a specified time, if so, we will set the self deletion configuration and then raise an error to fail the step, which has the default [On failure action] setting (i.e. Quit the job reporting failure) as shown below:
declare @job_id uniqueidentifier; set @job_id=CONVERT(uniqueidentifier, $(ESCAPE_NONE(JOBID))); -- token is used if getdate() > '2016-07-15 17:00:00' -- change to your preferred end time begin EXEC msdb.dbo.sp_update_job @job_id=@job_id, @delete_level=2; -- set auto deletion raiserror ('self deletion', 16, 1); -- cause the step failure end
Failure Notifications and Auto Disable SQL Server Agent Jobs
The second and third requirements are basically the same from a coding perspective. The basic design is to check the msdb.dbo.sysjobhistory table and if we find [N] continuous failures, we will take the following actions:
- Disable the notification (or disable the job).
- Send out an email to notify the DBA team about the disable action.
- If a next run of the job succeeds, we should reset the notification; i.e. enabling the notification until there are another N continuous failures.
Of course, if you disable the job, the job will never run until enabled outside of the job (because a disabled job cannot enable itself).
set nocount on; declare @Failure_Number int =3; -- set your preferred N times before you want to take actions declare @cnt int; declare @subject varchar(200), @body varchar(512), @job_name varchar(128); declare @myemail varchar(300) = '[email protected]'; -- replace with your own email account declare @JOB_ID uniqueidentifier; set @job_id = convert(uniqueidentifier, $(ESCAPE_NONE(JOBID))); select @job_name = name from msdb.dbo.sysjobs where job_id = @job_id; ; with c as ( select top (@Failure_Number) instance_id, run_status from msdb.dbo.sysjobhistory where job_id =@job_id and step_id=0 order by instance_id desc ) select @cnt = count(*) from c where run_status = 0 if (@cnt=@Failure_Number) begin -- exec msdb.dbo.sp_update_job @job_id = @job_id, @enabled=0; -- un-comment this line if you want to disable the job if (select notify_email_operator_id from msdb.dbo.sysjobs where JOB_ID=@job_id) > 0 begin EXEC msdb.dbo.sp_update_job @job_ID=@job_id, @notify_email_operator_name=N''; set @subject = 'Job [' + @job_name +'] failure notification is disabled'; set @body = 'this job''s failure notification is curently disabled after ' + cast(@failure_number as varchar(10)) + ' continuous failures;' exec msdb.dbo.sp_send_dbmail @recipients=@myemail, @subject=@subject, @body = @body; end end else -- if there is a success after [@Failure_Number] failures, we will reenable the notification begin if (select notify_email_operator_id from msdb.dbo.sysjobs where JOB_ID=@job_id) = 0 EXEC msdb.dbo.sp_update_job @job_ID=@job_id, @notify_level_email=2, @notify_email_operator_name=N'DBA_Test'; -- replace with your own OPERATOR name here end
SQL Server Agent Scheduling
For the fourth requirement, a possible solution is to have an if condition to see whether the current time is within the specified time window, if so, exit the current step and continue to the next step, otherwise, raise an error to quit the job.
To prevent the job logging the failure status on step 1 (as we purposely fail this step just for quitting the job), we can configure the step as [On failure action]=[Quit the job reporting success] as shown below:
-- I am hard-coding the date time here for simplicity, but a better way is to create a table to hold all these values for flexibility declare @dt datetime = getdate(); declare @time varchar(10) = convert(varchar(10), @dt, 108); if ( (( @time between '08:00:00' and '17:00:00') or ( @time between '20:00:00' and '23:59:59') or ( @time between '00:00:00' and '02:00:00')) and ( datename(dw, @dt) in ('Monday','Tuesday', 'Wednesday')) ) -- Mon/Tue/Wed, between 8:00am to 5:00pm, 8:00pm to 2:00am or ( (( @time between '09:00:00' and '14:00:00') or ( @time between '19:00:00' and '23:59:59')) and ( datename(dw, @dt) in ('Thursday','Friday')) ) -- Thur/Fri, between 9:00am to 2:00pm, 7:00pm to 12:00am or ( (( @time between '10:00:00' and '15:00:00') or ( @time between '22:00:00' and '23:59:59') or ( @time between '00:00:00' and '03:00:00')) and ( datename(dw, @dt) in ('Thursday','Friday')) ) -- Sat/Sun, between 10:00am to 3:00pm, 10:00pm to 3:00am begin return; end else -- out of allowed windows begin raiserror ('Not in the allowed window', 15, 1); end
If we prefer not hard-coding values, we can use the following sample code by putting values into a table:
/* use msdb; -- you can choose your own user database create table msdb.dbo.Schedule (StartTime char(8), Endtime char(8), [WeekDay] varchar(12)) -- populate it with needed values insert into msdb.dbo.Schedule (StartTime, EndTime, [WeekDay]) values ('08:00:00', '17:00:00', 'Monday'), ('20:00:00', '23:59:59', 'Monday'), ('00:00:00', '02:00:00', 'Monday') , ('08:00:00', '17:00:00', 'Tuesday'), ('20:00:00', '23:59:59', 'Tuesday'), ('00:00:00', '02:00:00', 'Tuesday') , ('08:00:00', '17:00:00', 'Wednesday'), ('20:00:00', '23:59:59', 'Wednesday'), ('00:00:00', '02:00:00', 'Wednesday') , ('09:00:00', '14:00:00', 'Thursday'), ('19:00:00', '23:59:59', 'Thursday') , ('09:00:00', '14:00:00', 'Friday'), ('19:00:00', '23:59:59', 'Friday') , ('10:00:00', '15:00:00', 'Saturday'), ('22:00:00', '23:59:59', 'Saturday'), ('00:00:00', '03:00:00', 'Saturday') , ('10:00:00', '15:00:00', 'Sunday'), ('22:00:00', '23:59:59', 'Saturday'), ('00:00:00', '03:00:00', 'Sunday') */ declare @dt datetime = getdate(); declare @time varchar(10) = convert(varchar(10), @dt, 108); if exists ( select * from msdb.dbo.Schedule where (@time between StartTime and EndTime) and datename(dw, @dt) = [WeekDay] ) return -- move on to next job step else raiserror ('Not in the allowed window', 15, 1);
SQL Server Agent Job Delayed Execution
The fifth requirement is relatively more complex in coding. What we do here is to update the job's associated schedules, so the next start time is [X] time later.
To simplify the work, we assume the job has only one schedule and it is running daily every [x] intervals, where "interval" can be hour/minute/second.
To achieve this work, we will use an external table to avoid hard-coding.
The basic business requirement is:
- If a job fails continuously [X] times, then we want the job to not run for the next [Y] interval, where interval can be hour/minute/second.
- After [Y] interval, we will reset the conditions, and from now on if the job fails continuously [X] times, another [Y] interval will be implemented, and so on so forth.
The requirement is challenging to implement with our current architecture, i.e. the first job step is our controlling step. The reason is simple, let's say we want an hourly job to NOT run for 3 hours after 3 continuous failures. Here is an example:
- Say a job fails at 9:00 AM, 10:00 AM and 11:00 AM, according to the requirement, the job should not run for the next 3 hours, i.e. not run at 12:00 PM/1:00 PM /2:00 PM, but will run again at 3:00 PM. But when the job started at 11:00 AM, in job step 1, we cannot predict whether the job will fail or not, so we cannot do anything. As such, we need the job to still run at 12:00 PM and until then we can evaluate whether the job had failed 3 times already. If so, we simply quit the job in step 1 without running all the following job steps and on the other hand, we can reset the job schedule time to start at 3:00 PM (i.e. 12:00 PM + 3 hours). This way, the job's behavior is run at 12:00 PM for job step 1 only, and not run at 1:00 PM/2:00 PM and starts again at 3:00 PM.
Here is the implementation, we first create a table and populate it (assuming we are targeting a job called [test_1] which has two steps. The first step is our controlling step with configuration [On failure action]=[Quit the job reporting success] and the second step only has a "raiserror" statement to always cause the job failure. Job [test_1] is scheduled to run every minute between 12:00:00am and 11:59:59pm.
use msdb -- you can use other user db if needed if object_id('dbo.JobData', 'U') is not null drop table dbo.JobData create table dbo.JobData ( JobName sysname , FailureNumber int default 3 -- how many continuous failures will occur before the delay action will kick in, default to 3 , DelayTime int default 0 -- how much time do you want to dealy? In HHMMSS format, so 20304 means dealy 2 hrs 3 min and 4 seconds , IsDelayed bit -- 0 = not delayed, still using the original StartTime, 1=delayed, using the RevisedStartTime , OriginalStartTime int -- original StartTime, in HHMMMSS format , OriginalEndTime int -- original EndTime, in HHMMMSS format , OriginalActiveDay int -- original Active Day in YYYYMMDD format , RevisedStartTime int -- Revised Start Time , RevisedStartDay int -- Revised Start Day ) go -- populate the table first -- we assume the job has only one schedule or if the job has multiple schedules, we will only delay the job on one schedule insert into dbo.JobData (JobName, IsDelayed, OriginalStartTime, OriginalEndTime,OriginalActiveDay, DelayTime) select top 1 j.name, 0, s.active_start_time, s.active_end_time, s.Active_Start_Date , 500 --DelayTime to be 00:05:00 i.e. 5 min delay Example only, choose your own delay interval from msdb.dbo.sysjobs j inner join msdb.dbo.sysjobschedules js on j.job_id = js.job_id inner join msdb.dbo.sysschedules s on s.schedule_id = js.schedule_id where j.name='test_1' order by s.Active_start_time -- change to your own job name
Now we have the following code for job step 1:
use msdb set nocount on; if not exists (select * from msdb.sys.tables where name='JobData') return; declare @Failure_Number int; -- set your preferred N times before you want to take actions declare @cnt int; declare @subject varchar(200), @body varchar(512); declare @OrigStartTime int, @OrigEndTime int, @OrigActiveDay int, @delaytime int, @isDelayed bit, @job_name varchar(128); declare @currTime int, @NewActiveDay int, @NewActiveTime int ; declare @day int, @hr int, @min int, @sec int; declare @job_id uniqueidentifier, @schedule_id int; set @job_id = convert(uniqueidentifier, $(ESCAPE_NONE(JOBID))); set @NewActiveday = 0; set @currTime = cast(replace(substring(convert(varchar, getdate(), 114),1,8), ':', '') as int) if not exists ( select * from msdb.dbo.jobdata d inner join msdb.dbo.sysjobs j on j.name = d.jobname and j.job_id = @job_id) return; -- do nothing select @job_name=JobName , @OrigStartTime = d.OriginalStartTime , @OrigEndTime = d.OriginalEndTime , @OrigActiveDay = d.OriginalActiveDay , @delaytime = d.delaytime , @isDelayed = d.isDelayed , @Failure_Number = d.FailureNumber from msdb.dbo.jobdata d inner join msdb.dbo.sysjobs j on j.name = d.jobname and j.job_id = @job_id; -- we assume there is only schedule for the job select top 1 @schedule_id = schedule_id from msdb.dbo.sysjobschedules where job_id = @job_id; -- check job history if it has failed three times, we set isDelayed = IsDelayed xor IsDelayed ;with c as ( select top (@Failure_Number) instance_id, run_status from msdb.dbo.sysjobhistory where job_id =@job_id and step_id=0 order by instance_id desc ) select @cnt = count(*) from c where run_status = 0 -- 0 = failure; 1=success if @cnt = @Failure_Number begin if @isDelayed=0 begin -- start to reset the next start time to be current time + @delayedtime set @sec = @currTime%100 + @delayTime%100 select @currTime /=100, @delayTime /=100 set @min = @currTime%100 + @delayTime%100 + @sec/60; set @sec = @sec%60 select @currTime /=100, @delayTime /=100 set @hr = @currTime%100 + @delayTime%100 + @min/60; set @min = @min%60 set @day = @hr / 24; set @hr = @hr %24 set @NewActiveTime = @hr*10000+ @min*100 + @sec; set @NewActiveDay = case when @day > 0 then cast(convert(char(8), getdate(), 112) as int) + @day else @OrigActiveDay end; if @NewActiveTime between @OrigStartTime and @OrigEndTime -- if @NewActiveTime is out of the pre-defined time window, we do nothing exec msdb.dbo.sp_update_schedule @schedule_id = @schedule_id , @active_start_time = @NewActiveTime , @active_start_date = @NewActiveDay; update d set IsDelayed = 1, RevisedStartTime = @NewActiveTime, RevisedStartDay = @NewActiveDay from msdb.dbo.JobData d where JobName=@job_name; -- raiserror to quit the job set @body='Quit the job because it has failed more than ['+ convert(varchar,@Failure_Number)+'] times'; raiserror (@body, 15, 1); -- this will set the run_status = 1 for step_id=0, when the job quits -- since the first step job is configured to [On failure action]=[Quit the job reporting success] -- thus, this guarantees there is no more than [@failure_number] continuous failures. end end --not reaching [@Failure_Number] times else if @isDelayed=1 begin update d set IsDelayed = 0, RevisedStartTime=0 from msdb.dbo.JobData d where JobName=@job_name; -- reverse the schedule to normal exec msdb.dbo.sp_update_schedule @schedule_id = @schedule_id , @active_start_time = @OrigStartTime , @active_start_date = @OrigActiveDay; end
Job step 2 has the following code:
raiserror ('Intentional failure, pls ignore!', 15, 1);
In the job history, we will see the following facts:
- After 3 continuous failures (on step 2), the fourth failure is on step 1, because step 1 detects the previous 3 failures and thus intentionally fires a raiserror to quit (reporting success as configured).
- The next run will be five minutes late as defined.
- There will be another 3 continuous failures on job step 2 before the next run fails on step 1 and then the next run will be 5 minutes later.
- The pattern repeats itself.
The tip gives us various ways to handle some interesting requirements for SQL Server Agent Job management. As a DBA, I often encounter the issues during weekends or at mid-night when scheduled SQL Server Agent Jobs are failing due to network or hardware issues or other related 3rd party applications. By default these job failures will send out emails, but I also know that in such scenarios, I really do not care about the failures and I actually want the jobs to stop for a while before running again (hoping the network will be restored by that time). If I ignore the notifications, I will end up next morning with tens of emails, which I need to delete anyway. So in such an environment, our DBA team agrees that we want the job to not start 3 hours after continuous failure of 3 times. This way, we will greatly reduce the emails received.
In the real world, other than this method, I also rely on creating a central monitoring job process to monitor all jobs of interest to implement our management rules (disable notifications, postpone executions, etc.).
Also please note if a schedule is shared by multiple jobs, you need to understand the consequences of postponing a job by changing the shared schedule.
My provided code assumes a job runs daily from 12:00 AM to 11:59:59 PM, but if a job schedule is from 8:00 AM to 8:00 PM (running hourly), we may need to add additional logic when the job continuously fails around the boundaries. For example, our rule is if the job continuously fails 3 times, we want the job to postpone its next run by 3 hours. So the job fails at 4:00 PM/5:00 PM/6:00 PM, at the 7:00 PM run the job should postpone itself to start at 10:00 PM (i.e. 3 hours later), but 10:00 PM is outside of the scheduled 8:00 AM to 8:00 PM time window, so what should we do?
- You may enhance/customize the above methods to meet your own business requirements. For example, for jobs belonging to a specific category, we want to apply an administration rule that if any job fails more than 3 times continuously, we will wait [X] hours before re-running. Or for jobs that have failed [Y] times in the last [X] hours, an email notification will be sent, etc.
- You may take a look at the following tips to learn more about SQL Server Agent Jobs and their schedules:
About the author
View all my tips