I'm responsible for ensuring that my SQL Servers can be
recovered properly in the event of a
disaster. I've heard a lot of the acronyms, but how do I apply this to practical steps with respect to my environment? What else do I need to know other than the magic acronyms?
Check out this tip to learn more.
In this tip we'll focus on the planning aspect around
backups. The two magical acronyms are RTO and RPO, which stand for Recovery Time Objective and Recovery Point Objective respectively. Let's start there.
Recovery Point Objective
In a nutshell, the RPO is the amount of data you can lose, measured by time. It is likely to be different from system to system and application to application. For instance, a critical system may have an RPO measured in minutes while a non-critical one may have it measured in days.
In order to meet RPO, you need to have the appropriate
backups. For instance, if you're talking about an RPO of 15 minutes, that means you'll need backups at least every 15 minutes for that system. How you accomplish that will again differ from system-to-system. Many times it depends on the size of the database(s) in question and the RTO.
Please note, if you have an RPO for
disaster recovery, you can't just look at your
backups that are being stored locally. You've got to think about whether or not those backups are getting offsite within the RPO. Here's an example where the RPO will be broken:
The backups are written to a server locally (or even directly to tape).
If that backups are being written to a server, that server is backed up nightly.
A third-party vendor picks up the tapes each morning and moves them to an alternate location.
Why does this fail? Quite simply, the backups are only as recent as when the tape backup was taken. So if the tape was written at 2 AM and the disaster occurs at 6 AM, the RPO is going to be busted. Furthermore, if said vendor doesn't pick up the tapes until 8 AM, the organization is in worse shape because that means the data is only current to 2 AM the previous day.
Recovery Time Objective
This is how much time you have to bring up a system from the time a disaster occurs. For instance, if RPO is set at 24 hours for a system, then that's how long you have to get the system back on-line. Keep in mind that recovering a system is usually more involved than just restoring a database or two. The question that needs to be asked is how long does it take to get the system on-line when you include everybody. This includes the folks that install and configure the application getting their part right. It also includes the times it takes for the folks in charge of backup operations to get your SQL Server backups in a usable location/system for you. As you might guess, your part of that 24 hours has begin to shrink and shrink fast.
This is where your backup scheme comes into play. Specifically, how fast you can recover based on your backup scheme. If you take a backup once a week and you do 15 minute transaction log backups for the rest of the week, not only are you potentially stuck in the event a transaction log backup is bad or corrupt (you are testing them, right?), but imagine if the disaster happens towards the end of the week. That's a lot of transaction log backups that have been restored after the full backup. That scheme may not allow your organization to meet its RTO for a given system. The only way to know for sure is to actually test the recovery. Look to test the worst case scenario multiple times to get a reasonable estimate of how long such a restore will take in the event of a true disaster.
The Run Book
One of the possibilities that is easy to forget is that key people are not available due to a disaster. Imagine that key person is you. If you're the only one that knows the steps to recover the SQL Servers to your company's specifications, your company is in trouble. Likewise, if you're the only one that knows the backup scheme, that means you're the only one that knows the recovery process. Again, your company has a big problem if you're not there.
Let's consider the best case scenario: you are there. However, if a disaster recovery test tends to be a hectic affair, imagine the real thing. Then imagine the real thing with key personnel out. In those kinds of situations, a checklist of what to do helps. Step-by-step procedures of how to bring a system back on-line is invaluable because it means steps don't get missed or skipped.
You could try to quickly sketch a series of steps after a real disaster. However, the reality is that with complex systems, steps will be missed. In addition, you may not have time to outline those steps. Therefore, the best time to develop that list of steps is when everything is fine. We typically call such steps a run book. Of course, how do you know if a run book has everything it should? How do you verify that each step is correct? That's what the test is for.
The Disaster Recovery Test Itself
When executing for a disaster recovery test, that run book should immediately come into play. The run book should be verified in the course of the test. Steps that are missing should be included. Processes and procedures that are incorrect should be edited. Anything that doesn't help recovery of the system should be removed. Of course, if you're doing the run book correctly, then you have to account for configurations and procedures that are test only and those that are a disaster situation only. The truth is there are always such steps and they should be in the run book.
A good set of advice I've heard with regards to writing and testing run books is the most senior technical personnel, the ones who are most likely to know the system including all the tiny but critical implemental details, write the run book and the most junior personnel execute it. The junior personnel will be the least likely to be able to use prior knowledge, especially in the more intricate parts of the recovery.
Whoever is executing the run book, it's extremely important to note any and all deviations from the run book. The objective is to ensure the run book is as complete and accurate as possible in the event of a real disaster. It can be tempting to gloss over "minor" issues. That will just cause complications if the run book actually has to be used. As a result, the focus should be on being as thorough as possible and in trying to find and locate every issue with every run book.
Is There More?
Yes, there's a lot more, such as getting
logins right, unlocking
encryption keys, knowing how long it'll take to check the
consistency on your databases, and the like. However, we'll save that for the next