Learn more about SQL Server tools

 

Tutorials          DBA          Dev          BI          Career          Categories          Events          Whitepapers          Today's Tip          Join

Tutorials      DBA      Dev      BI      Categories      Events

DBA    Dev    BI    Categories

 

SQL Server Disaster Recovery Planning and Testing


By:   |   Read Comments   |   Related Tips: More > Disaster Recovery

Problem

One old adage that has stood the test of time is "you will never know until you try".  I would have to say that is the case with a disaster recovery plan.  I would be surprised to hear any company relying on IT does not have a disaster recovery plan.  Some are probably more formal than others, but in the middle of many of those plans are SQL Servers and the need to recover them quickly to maintain the business operations.  But have you put your plan to the test?  What about when key people are out of town?  Or when you do not have direct access to your SQL Servers or network?  Does your DR Plan fall into place or does it fall like a house of cards?

Solution

Your disaster recovery plan is only as good as the people who built the processes and implemented the technology.  Many disaster recovery plans are compromises with the business to support business operations at a reduced cost.  That is fine as long as the correct expectations are set between IT and the business.  When it comes to an external customer and contractual agreements are in place, then those agreements and the associated risk need to be clearly understood to put the appropriate people, processes and technology into place.

Key Questions

The first step in testing your disaster recovery plan is to ask yourself and your team some poignant questions and respond in simple terms, maybe even 1 word answers.  Depending on how you and your team answer these questions dictates how the fire drill can be conducted.

  • SLA - What is your Service Level Agreement (SLA) with the business or your customers?
  • Cost - What is the tangible cost of downtime for your users, application, business, etc?
  • Prevention - What type of disaster are you trying to prevent?
  • Recovery - What type of disaster are you trying to recover from?
  • Time - How much time and money do you have to spend on building, testing and maintaining the disaster recovery plan on a daily, weekly, monthly or quarterly basis?
  • Priority - Where does the disaster recovery plan fit into the organizational, departmental and personal priority list?
  • Need - What are the true disaster recovery risks for your organization based on where you are located, your line of business, your affiliations, etc?
  • Responsibility - What are your disaster recovery responsibilities and why do you have those responsibilities? 
  • Plan - Do you have a documented disaster recovery plan?
  • Testing - Have you tested the disaster recovery plan?
  • Documentation - Do you update your documentation as dependent objects change? Or monthly? Or quarterly?
  • People - Will you have people to support the business?
  • Power - Will you have power to support the business? 
  • Access - Will you be able to get to the office or make a remote connection?
  • Process - Will you have a process that everyone on the team can follow?
  • Technology - Have you invested in technology that will improve the prevention or recovery from a disaster?
  • Dependencies - Are you and your team dependent on other teams or external entities in order for your applications to operate properly?  Do those teams test their plans on a regular basis?
  • Mitigation - Have you put multiple lines of defenses up to prevent a disaster as well as recover from one?
  • Limits - How long can your business run without the IT infrastructure and applications?
  • Alternatives - Can the business run on paper based operations for a finite period of time?
  • Experience - If you have an entire backup site\infrastructure have you failed over to it during a low usage period such as a Saturday or Sunday so the business really knows how the applications will perform even with a limited number of users?
  • Impacts - If you have an extended downtime, what will that do to your business in terms of customer loyalty, industry reputation, cash flow, etc?

The Fire Drill

Based on your SLAs and disaster recovery plans in place, the fire drill is where the proverbial rubber meets the road.  As such, it is necessary to test each of the key components for your disaster recovery plan:

  • Facilities
    • Try working from these facilities and determine where you are productive and not productive.
    • Do you have facilities for your team to work if a disaster strikes?  If these are different facilities, does everyone know how to get there?  If needed, can your team members work from home?
    • Validate that the facilities have computers and the software needed.  Burn copies of the media and license keys to use them as needed.  If not, does your team use notebooks which should satisfy this need? 
    • Does public transportation support your facilities?  If your work force primarily uses public transportation, will they be able to get to the backup facility?
  • Power
    • If you have a generator, do you run on it every week at a regularly scheduled time to validate that it works?  Has all of the maintenance been performed on the generator?  If a wide spread disaster strikes, will you be able to get fuel to power the generator?  Do you have an SLA with a fuel delivery company?
    • If you have multiple office locations, are they on the same grid and will they be impacted by a wide spread power outage?
    • Are just your servers on the generators rather than user machines?
    • If you are just using battery backup, how much time will the batteries support?  Can you shut down lower priority servers (i.e. test and development)?
  • People
    • Test how quickly your team can get to the office if a disaster strikes.
    • Do you have team members that will be onsite if a disaster occurs such as second and third shift to lessen the burden of having your entire team flock to the office?
    • Do you have scheduled shifts for your team members to support the organization and balance personal needs?
  • Network
    • Will your network be able to route traffic between the needed facilities?
    • Do you have redundant long distance carriers to support the external interfaces to your organization and applications?
    • Can you drop particular segments or your network and remain operational?  Do you have traffic that hops between offices that impact other facilities?
  • Process
    • Consider scheduling a weekend to test these processes and work the kinks out especially if the disaster recovery process has not been testing or has changed dramatically since the last test.  Analyze, document and improve the process for an unscheduled fire drill. 
    • Once the kinks have been worked out of the plan, schedule your real fire drill that is unplanned for your team members.  Compare how this fire drill differed from the scheduled test in terms of down time, memory of processes and leadership to complete the process. 
  • Technology
    • Do you failover your nodes for the clusters or log shipping machines to validate that the applications will work properly?  Do other infrastructure components need to change as well when your SQL Servers change?
    • Do you have the latest set of passwords in order to access the systems remotely?
    • Do you have the correct patches between your servers or is your production server patched recently and your backup servers are 9 months old?
    • Test how quickly can you really get your backup tapes from an off site location and then consider the time if all of the organizations in the area are requesting tapes, roads are backed up and the software is out of sync to merge a tape.
    • If the Internet is inaccessible, how are you going to research and troubleshoot an issue?
Next Steps
  • Depending on how you answered the questions above truly dictates what your next steps should be. 
    • If you have no plan in place and your organization is relying on IT, then no better time than the present to build a plan. 
    • If you have a plan in place, but know it has changed, now is the time to update it.  It is better to find out now that passwords are out of date then during a crisis. 
    • If you have not tested your plan in a while, now is the time to put it through the motions to see how good or bad the plan really is.  This could be a great opportunity to find room for improvement to save precious time if and when a disaster strikes
  • Take a hard look at your disaster recovery plan and determine its state of affairs.  If it is a dusty notebook that makes you uncomfortable and you know changes have been made to your environment, consider where these should be on your daily or weekly priority list.


Last Update:






About the author
MSSQLTips author Jeremy Kadlec Since 2002, Jeremy Kadlec has delivered value to the global SQL Server community as an Edgewood Solutions SQL Server Consultant, MSSQLTips.com co-founder and Baltimore SSUG co-leader.

View all my tips
Related Resources


 









Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name    *Email    Notify for updates 


Get free SQL tips:

*Enter Code refresh code     



Learn more about SQL Server tools