Managing Hung Jobs in SQL Server Agent Job and Oracle Job Scheduler

By:   |   Updated: 2022-09-22   |   Comments   |   Related: More > Other Database Platforms


Problem

My company has multiple scheduled recurring activities in both SQL Server and Oracle and sometimes the jobs hang. How can I handle this so processing can continue with minimal interruptions?

Solution

Job hangs can happen for numerous reasons:

  • A deadlock occurs and doesn't get cleared automatically
  • The job finished but is reported as still running; maybe the agent or the server rebooted
  • A package is corrupted, causing it to fail intermittently
  • A local resource is temporarily crowded
  • An executable failed with a transient error
  • Network communication got interrupted
  • Remote resources are temporarily unavailable
  • DNS intermittent malfunction, maybe it was updated recently, or one replica is failing
  • Waiting for user input before closing
  • No free space left for writing the log
  • A command needs to finish graciously before starting the next command
  • Antivirus blocking the job at that moment
  • A remote database session got killed and was unable to be recovered/terminated locally
  • A remote database query failed due to any of the reasons above

When any of these occur, sometimes you can't handle it in the code (retry/rollback). If the job can be restarted, you can automate this task to continue processing information with minimal disruption. However, be careful to exclude long-running jobs (suspected corrupt data, waiting on resource locked by another session, big logfile, etc.) or when the failure must be fixed first (network configuration changes, corrupt database files, disk full, etc.), as that will adversely affect the performance of those jobs and will only increase the time they take to complete.

How do you automate this task in SQL Server 2019 and Oracle Windows 19c, and what are the differences?

Oracle

In Oracle Windows 19c, there is a standard way to create this setup:

Often, it's still not clear or straightforward even after reading the following documentation:

So, I'm going to show a more straightforward approach that is easier to understand and can be extended and adapted to your environment. This will be simpler than creating the job, a scheduler program (with its metadata argument EVENT_MESSAGE being thrown), and a stored procedure (which receives the SYS.SCHEDULER$_EVENT_INFO). And once you understand how it works, it will be much easier to create the scheduler program and stored procedure to improve performance as recommended by Oracle experts.

First, you need to subscribe to event-based notifications sent from the scheduler event queue. To do this, run the command below, logged in as a DBA user, taking note of the name assigned:

BEGIN
  DBMS_SCHEDULER.ADD_EVENT_QUEUE_SUBSCRIBER('DATA_EVENT');
END;
/

Note: Use the command below to remove the subscription:

BEGIN
  DBMS_SCHEDULER.REMOVE_EVENT_QUEUE_SUBSCRIBER('DATA_EVENT');
END;
/

To check which users are subscribed to event-based notifications from the scheduler event queue, run the following commands. The ones added are the last row in the first output and the last two rows in the second output.

SET LINESIZE 300
COL OWNER FOR A5
COL QUEUE_TABLE FOR A21
COL CONSUMER_NAME FOR A25
COL RULE FOR A35
SELECT OWNER, QUEUE_TABLE, CONSUMER_NAME, RULE
  FROM DBA_QUEUE_SUBSCRIBERS
 WHERE QUEUE_NAME = 'SCHEDULER$_EVENT_QUEUE';

OWNER QUEUE_TABLE           CONSUMER_NAME             RULE
----- --------------------- ------------------------- -----------------------------------
SYS   SCHEDULER$_EVENT_QTAB ILM_AGENT
SYS   SCHEDULER$_EVENT_QTAB SCHEDULER$_EVENT_AGENT
SYS   SCHEDULER$_EVENT_QTAB DATA_EVENT                tab.user_data.object_owner = 'SYS'
 

COL OBJECT_NAME FOR A30
SELECT OWNER, OBJECT_NAME, OBJECT_TYPE, STATUS
  FROM DBA_OBJECTS
 WHERE OBJECT_NAME LIKE '%SCHEDULER$_EVENT_QUEUE%';

OWNER OBJECT_NAME                    OBJECT_TYPE             STATUS
----- ------------------------------ ----------------------- -------
SYS   SCHEDULER$_EVENT_QUEUE_R       RULE SET                VALID
SYS   SCHEDULER$_EVENT_QUEUE_N       RULE SET                VALID
SYS   SCHEDULER$_EVENT_QUEUE         QUEUE                   VALID
SYS   SCHEDULER$_EVENT_QUEUE         UNDEFINED               VALID
SYS   SCHEDULER$_EVENT_QUEUE$21      RULE SET                VALID
SYS   SCHEDULER$_EVENT_QUEUE$21      RULE                    VALID
	

Now let's show the query to check which jobs have exceeded their max duration. It will only return records after you modify the existing jobs (to be explained later). The table DBA_SCHEDULER_JOBS has information about the existing jobs. We're interested in the last time the job ran and if it's running. The table SYS.AQ$SCHEDULER$_EVENT_QTAB means AQ for Advanced Queuing and QTAB for Queue Table. Note: SE.USER_DATA is an object, so it must be cast appropriately to get its nested columns EVENT_TIMESTAMP and EVENT_TYPE. We're interested at what time the job exceeded its max duration:

COL OWNER FOR A10
COL JOB_NAME FOR A10
COL EVENT_TIMESTAMP FOR A15
COL EVENT_TYPE FOR A16
COL LAST_START_DATE FOR A10
COL STATE FOR A10
SELECT J.OWNER, J.JOB_NAME,
       TREAT(SE.USER_DATA AS SYS.SCHEDULER$_EVENT_INFO).EVENT_TIMESTAMP EVENT_TIMESTAMP,
       TREAT(SE.USER_DATA AS SYS.SCHEDULER$_EVENT_INFO).EVENT_TYPE EVENT_TYPE,
       J.LAST_START_DATE, J.STATE
  FROM DBA_SCHEDULER_JOBS J
 INNER JOIN SYS.AQ$SCHEDULER$_EVENT_QTAB SE ON
   TREAT(SE.USER_DATA AS SYS.SCHEDULER$_EVENT_INFO).OBJECT_OWNER = J.OWNER AND
   TREAT(SE.USER_DATA AS SYS.SCHEDULER$_EVENT_INFO).OBJECT_NAME = J.JOB_NAME
 WHERE TREAT(SE.USER_DATA AS SYS.SCHEDULER$_EVENT_INFO).EVENT_TIMESTAMP >= J.LAST_START_DATE
   AND TREAT(SE.USER_DATA AS SYS.SCHEDULER$_EVENT_INFO).EVENT_TYPE = 'JOB_OVER_MAX_DUR'
   AND J.STATE = 'RUNNING'
 ORDER BY J.LAST_START_DATE;

With this query, we can create an event-based job that will stop other jobs that have exceeded their max duration. You can filter them as needed by modifying the query. The second step is to create the job with JOB_ACTION containing the previous query escaping single quotes and stopping the job with DBMS_SCHEDULER.STOP_JOB. Specifying this job will launch when the event type is JOB_OVER_MAX_DUR, specifying the queue type SYS.SCHEDULER$_EVENT_QUEUE and specifying the subscriber name created initially. Note: TAB.USER_DATA is a special collection and must be used exactly as that.

BEGIN
  DBMS_SCHEDULER.CREATE_JOB(
    job_name => 'STOP_HUNG_JOBS',
    job_type => 'PLSQL_BLOCK',
    job_action => '
BEGIN
  FOR i IN (
    SELECT DISTINCT OWNER||''.''||JOB_NAME JOB FROM (
      SELECT J.OWNER, J.JOB_NAME,
             TREAT(SE.USER_DATA AS SCHEDULER$_EVENT_INFO).EVENT_TIMESTAMP EVENT_TIMESTAMP,
             TREAT(SE.USER_DATA AS SCHEDULER$_EVENT_INFO).EVENT_TYPE EVENT_TYPE,
             J.LAST_START_DATE, J.STATE
        FROM DBA_SCHEDULER_JOBS J
       INNER JOIN SYS.AQ$SCHEDULER$_EVENT_QTAB SE ON
       TREAT(SE.USER_DATA AS SCHEDULER$_EVENT_INFO).OBJECT_OWNER = J.OWNER AND
       TREAT(SE.USER_DATA AS SCHEDULER$_EVENT_INFO).OBJECT_NAME = J.JOB_NAME
       WHERE TREAT(SE.USER_DATA AS SCHEDULER$_EVENT_INFO).EVENT_TIMESTAMP >= J.LAST_START_DATE
         AND TREAT(SE.USER_DATA AS SCHEDULER$_EVENT_INFO).EVENT_TYPE = ''JOB_OVER_MAX_DUR''
         AND J.STATE = ''RUNNING''
       ORDER BY J.LAST_START_DATE)) LOOP
    DBMS_SCHEDULER.STOP_JOB(i.JOB, TRUE);
  END LOOP;
END;
',
    event_condition => 'TAB.USER_DATA.EVENT_TYPE = ''JOB_OVER_MAX_DUR''',
    queue_spec => 'SYS.SCHEDULER$_EVENT_QUEUE, DATA_EVENT',
    enabled => TRUE);
END;
/

Logged in with the user MYDB, you can create a job with the query below to test it. It is set to wait for 120 seconds, to launch every minute, and to raise the event JOB_OVER_MAX_DUR if it has been running for more than 60 seconds, which is the smallest possible value.

BEGIN
  DBMS_SCHEDULER.CREATE_JOBS(
    JOB_DEFINITION_ARRAY(
      JOB_DEFINITION(
        job_name => '"MYDB"."TEST"',
        job_type => 'PLSQL_BLOCK',
        job_action => 'DBMS_LOCK.SLEEP(120);',
        repeat_interval => 'FREQUENCY=MINUTELY;INTERVAL=1;',
        max_run_duration => INTERVAL '60' SECOND,
        enabled => TRUE,
        number_of_arguments => 0)));
END;
/

To specify the MAX_RUN_DURATION for an existing job is as follows:

BEGIN
  DBMS_SCHEDULER.SET_ATTRIBUTE(
    name => '"MYDB"."TEST"',
    attribute => 'MAX_RUN_DURATION',
    value => NUMTODSINTERVAL(60, 'SECOND'));
END;
/

You can check the current active jobs and their configuration with the queries below:

SET LINESIZE 300
SET PAGESIZE 500
COL OWNER FOR A5
COL JOB_NAME FOR A22
COL STATE FOR A9
COL REPEAT_INTERVAL FOR A30
COL LAST_START_DATE FOR A16
COL LAST_RUN_DURATION FOR A16
COL NEXT_RUN_DATE FOR A16
COL START_DATE FOR A16
COL MAX_RUN_DURATION FOR A15
SELECT OWNER, JOB_NAME, STATE, LAST_START_DATE, LAST_RUN_DURATION, NEXT_RUN_DATE
  FROM DBA_SCHEDULER_JOBS
 WHERE JOB_NAME IN ('TEST','STOP_HUNG_JOBS')
   AND ENABLED='TRUE';

OWNER JOB_NAME               STATE     LAST_START_DATE  LAST_RUN_DURATIO NEXT_RUN_DATE
----- ---------------------- --------- ---------------- ---------------- ----------------
SYS   STOP_HUNG_JOBS         SCHEDULED 30-AUG-22 09.07. +000000000 00:00
                                       26.017000 PM AME :00.297000
                                       RICA/MEXICO_CITY
 
MYDB  TEST                   RUNNING   30-AUG-22 09.07.                  30-AUG-22 09.07.
                                       26.314000 PM AME                  04.205000 PM AME
                                       RICA/MEXICO_CITY                  RICA/MEXICO_CITY
 


SELECT OWNER, JOB_NAME, START_DATE, REPEAT_INTERVAL, ENABLED, MAX_RUN_DURATION
  FROM DBA_SCHEDULER_JOBS
 WHERE JOB_NAME IN ('TEST','STOP_HUNG_JOBS')
   AND ENABLED='TRUE';

OWNER JOB_NAME               START_DATE       REPEAT_INTERVAL                ENABL MAX_RUN_DURATIO
----- ---------------------- ---------------- ------------------------------ ----- ---------------
SYS   STOP_HUNG_JOBS                                                         TRUE
MYDB  TEST                   30-AUG-22 09.02. FREQUENCY=MINUTELY;INTERVAL=1; TRUE  +000 00:01:00
                             04.283000 PM AME
                             RICA/MEXICO_CITY

You can see what happens in each schedule with the query and output below:

COL LOG_DATE FOR A10
COL OWNER FOR A5
COL JOB_NAME FOR A22
COL STATUS FOR A10
COL REQ_START_DATE FOR A10
COL ACTUAL_START_DATE FOR A10
COL RUN_DURATION FOR A13
COL ADDITIONAL_INFO FOR A10
SET LINESIZE 300
SET PAGESIZE 500
SELECT LOG_DATE, OWNER, JOB_NAME, STATUS, ACTUAL_START_DATE, RUN_DURATION, ADDITIONAL_INFO
  FROM DBA_SCHEDULER_JOB_RUN_DETAILS
 WHERE JOB_NAME IN ('TEST','STOP_HUNG_JOBS')
 ORDER BY ACTUAL_START_DATE;

LOG_DATE   OWNER JOB_NAME               STATUS     ACTUAL_STA RUN_DURATION  ADDITIONAL
---------- ----- ---------------------- ---------- ---------- ------------  ----------
30-AUG-22  MYDB  TEST                   STOPPED    30-AUG-22  +000 00:01:18 REASON="St
08.33.37.0                                         09.32.18.3               op job wit
17000 PM -                                         61000 PM A               h force ca
06:00                                              MERICA/MEX               lled by us
                                                   ICO_CITY                 er: 'SYS'"
 
30-AUG-22  SYS   STOP_HUNG_JOBS         SUCCEEDED  30-AUG-22  +000 00:00:01
08.33.37.0                                         09.33.35.8
17000 PM -                                         77000 PM A
06:00                                              MERICA/MEX
                                                   ICO_CITY
 
30-AUG-22  MYDB  TEST                   STOPPED    30-AUG-22  +000 00:01:18 REASON="St
08.34.55.6                                         09.33.37.0               op job wit
89000 PM -                                         17000 PM A               h force ca
06:00                                              MERICA/MEX               lled by us
                                                   ICO_CITY                 er: 'SYS'"
 
30-AUG-22  SYS   STOP_HUNG_JOBS         SUCCEEDED  30-AUG-22  +000 00:00:01
08.34.55.6                                         09.34.54.8
89000 PM -                                         92000 PM A
06:00                                              MERICA/MEX
                                                   ICO_CITY

By creating this setup, you have automated the hung job stop functionality, and in the next job schedule, it can continue processing data without noticing and requiring manual intervention.

Additionally, to clear the scheduler job run details for any reason, you can do it with the following:

DELETE FROM DBA_SCHEDULER_JOB_RUN_DETAILS WHERE JOB_NAME IN ('TEST','STOP_HUNG_JOBS');

To clear the queue events, use the following:

DECLARE
  PO DBMS_AQADM.AQ$_PURGE_OPTIONS_T;
BEGIN
  PO.BLOCK := FALSE;
  DBMS_AQADM.PURGE_QUEUE_TABLE(
    QUEUE_TABLE => 'SCHEDULER$_EVENT_QTAB',
    PURGE_CONDITION => NULL,
    PURGE_OPTIONS => PO);
END;
/

Once you've confirmed the stop job works successfully, you can drop the test job:

BEGIN
  DBMS_SCHEDULER.STOP_JOB('MYDB.TEST');
  DBMS_SCHEDULER.DISABLE('MYDB.TEST');
  DBMS_SCHEDULER.DROP_JOB('MYDB.TEST', TRUE);
END;
/

The first command stops the job if it's running, the second disables the job from running in the next schedule, and the third deletes the job. Now you need to set the MAX_RUN_DURATION for the existing jobs you want to be stopped, and you're done.

SQL Server

No job property indicates the max time it should run, nor a way to raise an event when it has exceeded a certain amount of time. However, in the schedules, you can specify an active end time, and one job can be associated with multiple schedules, so this can be used to create a job that stops hung jobs.

Let's show the query to check which jobs have exceeded the schedule active end time, but it will only return records after you specify this property in a schedule:

SELECT js.schedule_id, j.name job_name, msdb.dbo.agent_datetime(CONVERT(VARCHAR, ja.start_execution_date, 112), s.active_end_time) MustFinishBy
 FROM msdb.dbo.sysjobs j
 INNER JOIN msdb.dbo.sysjobschedules js ON js.job_id = j.job_id
 INNER JOIN msdb.dbo.sysschedules s ON s.schedule_id = js.schedule_id
 INNER JOIN msdb.dbo.sysjobactivity ja ON ja.job_id = j.job_id
 WHERE ja.start_execution_date IS NOT NULL
   AND s.active_end_time IS NOT NULL

With this query, we can create a job that will stop other jobs that have exceeded the scheduled active end time; you can filter them as needed by modifying the query. The steps below create the job, associate it with the local server, add a step to the job, add a schedule to run it daily (3rd parameter) every (5th parameter) minute (4th parameter), and attach the schedule to the job:

EXEC msdb.dbo.sp_add_job 'STOP_HUNG_JOBS', 1
EXEC msdb.dbo.sp_add_jobserver @job_name='STOP_HUNG_JOBS', @server_name = @@SERVERNAME
EXEC msdb.dbo.sp_add_jobstep @job_name='STOP_HUNG_JOBS', @step_id=1, @step_name='STOP_HUNG_JOBS', @command='
DECLARE @cmd VARCHAR(MAX);
SET @cmd = '''';
SELECT @cmd=@cmd+''EXEC msdb.dbo.sp_stop_job ''''''+job_name+'''''';'' FROM (
SELECT DISTINCT job_name FROM (
SELECT js.schedule_id, j.name job_name, msdb.dbo.agent_datetime(CONVERT(VARCHAR, ja.start_execution_date, 112), s.active_end_time) MustFinishBy
  FROM msdb.dbo.sysjobs j
 INNER JOIN msdb.dbo.sysjobschedules js ON js.job_id = j.job_id
 INNER JOIN msdb.dbo.sysschedules s ON s.schedule_id = js.schedule_id
 INNER JOIN msdb.dbo.sysjobactivity ja ON ja.job_id = j.job_id
 WHERE ja.start_execution_date IS NOT NULL
   AND s.active_end_time IS NOT NULL) A
 WHERE GETDATE() > A.MustFinishBy) B;
EXEC (@cmd);
'
EXEC msdb.dbo.sp_add_schedule 'MinutelyEvery1Minute', 1, 4, 1, 0x4, 1
EXEC msdb.dbo.sp_attach_schedule @job_name='STOP_HUNG_JOBS', @schedule_name='MinutelyEvery1Minute'

To test this job, you can create schedules per minute programmatically with the query below. They're created to run every minute and set to end the minute after it starts:

DECLARE
  @cnt INT = 0,
  @min VARCHAR(2),
  @nextmin VARCHAR(2),
  @hr VARCHAR(2),
  @nexthr VARCHAR(2);
WHILE @cnt < 1440
BEGIN
  SET @min = RIGHT('0'+CAST(@cnt%60 AS VARCHAR(2)), 2)
  SET @nextmin = RIGHT('0'+CAST((@cnt+1)%60 AS VARCHAR(2)), 2)
  SET @hr = RIGHT('0'+CAST(@cnt/60 AS VARCHAR(2)), 2)
  SET @nexthr = CASE WHEN @nextmin='00' THEN RIGHT('0'+CAST((@cnt/60)+1 AS VARCHAR(2)), 2) ELSE @hr END
  SET @nexthr = CASE WHEN @nexthr='24' THEN '00' ELSE @nexthr END
  EXEC ('EXEC msdb.dbo.sp_add_schedule '''+@hr+':'+@min+''', 1, 4, 1, @active_start_time='''+@hr+@min+'00'', @active_end_time='''+@nexthr+@nextmin+'00''')
  SET @cnt = @cnt + 1;
END;

If you already have a schedule, you can update it with the query below, 000100 means the first minute after midnight:

EXEC msdb.dbo.sp_update_schedule @name='00:00', @active_end_time = 000100;

You can check the existing schedules and their active end time with the query below:

SELECT name, active_start_date, active_start_time, active_end_date, active_end_time
  FROM msdb.dbo.sysschedules

Note: The results look strange:

sysschedules first part
sysschedules second part

But, they display correctly when you view the schedules:

manage schedules
job schedule properties 00:00
job schedule properties 23:59

Also, note that the last schedule, even when we set it to 00:00:00, changed to 11:59:59 PM.

Now you can create the test job as follows:

EXEC msdb.dbo.sp_add_job 'test', 1
EXEC msdb.dbo.sp_add_jobserver @job_name='test', @server_name = @@SERVERNAME
EXEC msdb.dbo.sp_add_jobstep @job_name='test', @step_id=1, @step_name='step1', @command='WAITFOR DELAY ''00:02:00'';'

And you can associate the job with the created schedules programmatically:

DECLARE
  @cnt INT = 0,
  @min VARCHAR(2),
  @hr VARCHAR(2);
WHILE @cnt < 1440
BEGIN
  SET @min = RIGHT('0'+CAST(@cnt%60 AS VARCHAR(2)), 2)
  SET @hr = RIGHT('0'+CAST(@cnt/60 AS VARCHAR(2)), 2)
  EXEC ('EXEC msdb.dbo.sp_attach_schedule @job_name=''test'', @schedule_name='''+@hr+':'+@min+'''')
  SET @cnt = @cnt + 1;
END;

You can see what happens in each schedule as described in this tip: Script for SQL Server Agent Job Issues Across All Instances.

In my case, the output of all jobs history is below:

output of all jobs history

By creating this setup, you have automated the hung job stop functionality. In the next job schedule, it can continue processing data without noticing or requiring manual intervention.

Once you've confirmed the stop job works successfully, you can drop the test job, which also deletes the per minute schedules created earlier and its job history:

EXEC msdb.dbo.sp_delete_job @job_name='test'

Now you just need to set the active_end_time for the existing schedules you want to be stopped, and you're done.

Next Steps

You can learn more about SQL Server scheduled jobs in the links below:




Get Started Now - Click here to get your free 14 day trial of SolarWinds Database Insights




get scripts

next tip button



About the author
MSSQLTips author Pablo Echeverria Pablo Echeverria worked for more than 10 years as a software programmer and analyst. In 2016, I switched jobs to a DBA position, where I have implemented new processes, created better monitoring tools and grown my data scientist skills.

View all my tips


Article Last Updated: 2022-09-22

Comments For This Article

















get free sql tips
agree to terms