By: Tim Smith | Comments | Related: > Disaster Recovery
Problem
One of the challenges we face is disasters catching us off guard and our teams using too much time to correct the issue. For an example, we recently had to resort to backups because an integrity check failed on one of our databases and we were unable to resolve the issue. We accepted the loss, but the time it took our team to bring an old backup with transactions logs online angered many of our clients, as they were unable to connect to the databases during the time it was offline. What can we do to prevent this from happening in the future?
Solution
In addition to simulating disaster recovery drills, I'd recommend looking at the big picture of your architecture, preparing solutions in advance while keeping backups, and considering who else needs to be involved. In the below steps, I outline the best approaches to being prepared for any disaster.
1. Architect a design that understands the problem (or opportunity).
We create solutions to problems or to enhance opportunities; if a solution creates more problems than it solves or if it creates more problems than opportunities it retrieves, it's not a solution, it's a headache.
While no design is perfect, that doesn't mean that whatever will work. Build the best design for the problem that is being solved or opportunity that is being leveraged. Always document why you went with one design over another and highlight how the design matches with the goal - for instance, how does the design solve the problems? Or, how does the design help leverage the opportunities?
In some situations, improvements will be required (or expected), and our design must reflect the current version of the release. Build architecture to retain deposit information or ETL structure for importing weather data may not change compared to the architecture for analyzing both. In the case of a changing architecture, consider the dependencies that will be affected by the change - for instance, a table that may have columns added to it later may affect stored procedures that use * instead of explicit columns. Dependencies can create outages too.
2. Identify the weaknesses of the architecture.
Every design has one or more weaknesses and the more honest we are about the weakness, the more we can prepare for when that weakness breaks.
Suppose that our database has daily backups with transactions logs running regularly to where we lose very little data in a disaster, and our backups and transaction logs are in multiple locations in case one of those fail. This is the best that we can do in this scenario due to our resource limitations.
Now, let's assume that there is a disaster - one of the weaknesses is how quickly we can get the backup and transaction logs back online. Is there a way we can automate this by grabbing the most recent files? The more we know our weakness, the more we can build an automated solution that can be quickly executed when it occurs. The same is true for any system limitation, whether it's an availability group, active-active replication, etc.
3. Study "what else could go wrong" scenarios.
We may know our weaknesses, but we don't know what we don't know, which is why we should not just evaluate what we know is our weakness, but also evaluate what else could be our weakness.
Do you want to be caught off guard when a database goes into suspect mode and clients can't access it? We have many options here such as hiring or working with a consultant to look at the design and identify weaknesses we're unaware of, identifying weaknesses through integrated learning (such as seeing the weakness of another unrelated system may make you aware of a weakness in your own system), or (and) listening to a customer screaming at us for poor design or because the architecture caused a problem for them (free feedback - always, always listen to this and respond immediately with a solution).
4. Build the solution.
Once the weaknesses are identified, create the solution and the more a script can do, the better, as this can be done at any time.
Is the solution a restore? Is it adding another server to the availability group? Whatever the solution is, don't procrastinate creating it. When people talk about "regular disasters that we don't have solutions for" what I've often discovered is that they already knew of the solution, they just waited on building it. The longer we wait, the more risk we're passing to our clients and they have every right to be angered, as this conveys disrespect. Some solutions may require alerting, as automation is either limited or impossible. In these cases, we need to ensure our alerting doesn't have a single point of failure, as an alerting server or script may fail too.
5. Practice the solution.
"Don't worry this will work when we need it" are the famous last words of untested code for disasters.
Do we want to discover a bug when we need to get the database online in less than a minute? Probably not. Simulate the solution in a test environment and always make sure to have a backup of the solution (if code, source control!). If our emergency script is saved on the same server that fails, then it doesn't matter what solution was created - and this is one of the many values of testing our code. Single point of failures can occur with servers, processes, applications, and even code that is designed to solve the former problems. When companies have asked me why they struggle to retain their talent, I often ask how often they perform these drills of testing solutions (or architecting in a way to avoid the problems) and this often explains why - it's the worker who pays with stress and a time cost when problems occur; talent will always leave in those situations.
6. Train your juniors.
The best way to teach people how something works is to simulate failure until the person is comfortable with the state.
For an example, some junior DBAs feel intimidated by replication, yet a Senior DBA turning "off" some important pieces of it (like the log reader, or the distribution server, etc.) and letting the junior DBA go figure out "what's wrong" is a great way to help someone feel comfortable with quickly resolving a failed state. The same is true for a corrupted database, especially if you can keep the state on a VM saved and train juniors to fix the issue and become comfortable with fixing the issue. The key isn't just to train the junior how to do something, but train him or her to be comfortable in the "state of panic" that people describe, as remaining calm through a resolution is a key to avoid creating a bigger disaster in some cases.
7. Coordinate with other teams.
In larger environments, we often work with other teams who have realms of authority, such as a System team, DBA team and Developer team. While startups and some agile environments may roll these together under "technologists", in the other environments, we want to coordinate and train with others who may be involved in the solutions.
As an example, we may not be able to reset a login in some cases and have to coordinate with the System team, or we may see corruption on a database after an ETL load and coordinate with the Developer (sometimes ETL) team to resolve the issue without affecting each team as much. The larger the environment, the more we'll need a matching development (or close to it) environment, as simulation and coordination among teams without a similar environment may result in a waste of time. If the problem cannot be simulated in the development environment due to resources, we'll have to expand it to simulate the problem. In general, I would highly avoid waiting for a disaster and hoping to have coordination among teams - in these environments, disaster recovery drills must involve multiple teams.
8. Reward excellence.
All living things, from farm animals to humans, respond and behave according to incentives. Once we complete the architecture, the solutions for disasters, and continue testing our solutions with drills, we have to respect what we've done. Always take time to appreciate good work, as it's an incentive to continue completing good work.
Next Steps
- A simple way to increase the urgency of testing and preparing for a disaster is to be as honest as possible about the costs of a system or application being offline.
- What issues have you faced before now that do not have solutions in place to prevent or solve them?
- What problems may you face and how can you identify these if you don't know?
- Who will you coordinate with when a disaster occurs?
- Read these other Disaster Recovery Related Tips
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips