SQL Server Database Integrity Check Best Practices
You work in an environment with many SQL Servers that have several databases on each server, and one day you see that an integrity check failed on one of the databases. Let's assume that nothing in the environment has been outlined, so you have no idea who the end user of this particular database is, or what priority level the data are. What are some things to keep in mind if this occurs, and ways to mitigate the confusion that may come when you have to make quick decisions?
Unfortunately, environments with many SQL Server instances running can overlook regular integrity checks, and running DBCC CHECKDB is not an option if you want to prevent permanent data loss. If you aren't currently running them, read this tip about the value of running them, as that is the first problem that will need to be solved. Let's look at a couple of data priority scenarios depending on importance that will be useful for the rest of this tip.
The Environment List
High Priority Databases
If databases hold data for an important business process that generates heavy revenue or could result in high costs due to loss, increase the frequency that you run an integrity check, even if that frequency is daily. In these environments, you cannot have a single point of failure, and when we're talking high revenue or cost, one of the best approaches is an Active-Active environment, where one can be "taken down" for maintenance while the other is running and vice versa. The key in this environment is that any data loss is unacceptable, as any loss of data could be costly. Imagine a scenario of a company that stores bitcoin trades and it loses a trade of 10,000 bitcoins due to data corruption from a disk failure (one row of data loss - one row); do you think that the person who lost 10,000 bitcoins will be okay with, "Unfortunately, we lost some data after our integrity check failed" when this could have been prevented? No, and the company in that case should expect a lawsuit. Simply put: the business will understand the importance, if this scenario applies. In this scenario, the company cannot even lose one row of data without it potentially costing the company a massive amount of money.
Medium Priority Databases
If the databases hold data that are high priority, but a very small amount of data loss is acceptable - such as monthly payment information that is always the same amount based on a schedule - the business may accept losing a data value or two because it can be deduced, or because at worst, it means they will credit a customer with a payment, even if the customer did not make it. In these environments, to minimize data loss, run an integrity check before backing up the database, keep the databases in full recovery mode with frequent transaction log backups, and consider a set up like availability groups with AlwaysOn. In this scenario, we want to still lose as little data as possible in the worst case scenario, but we know that if we lose a small amount of data, the cost will be in a range that our business can accept.
Low Priority Databases
These databases hold low priority and data loss is acceptable. This may be data from an ETL process that can be re-loaded with a minimal time window, data that are stored on paper that have been saved to a database and can be re-loaded. A great example of this is data for marketing to new clients where a database stores potential clients: in some companies, these data could be re-loaded through the same process that identified these clients; for other companies, this may be a high or medium priority. The recovery plan for these is often a re-load, or in some cases, accepting data loss. In these environments, more than likely restoring an old backup, or using repair and accepting data loss will be acceptable. Make sure that you confirm this with the end user (or users) and never assume this without evidence.
The "Scary" Others
One disaster-waiting-to-happen when you are first introduced to an environment is a business team treating a database as less important than it is, as we always want to err on the side of caution; being overly cautious is less of a problem in most cases. In an ideal environment, you will know how important each database is for each server before ever experiencing an integrity check failure; however, this will seldom occur and more often than not, it will be an after-the-fact discovery. As an example, you may view a database as on the low priority list based on how its designed or discussed, only to find out later that it's a high priority database where no data loss is acceptable. Likewise, you may propose a solution to immediately move the database to an architecture where there is no single point of failure due to its priority, yet the business procrastinates on this. In some situations, you may want to quickly exit the company or contract with the client as the disaster will be on you even if you warned about the disaster that could happen.
High Priority Databases
In the case of high priority databases in large environments (or environments in which the revenue model is in the billions), I refuse to accept a design that is not Active-Active and clients or companies can find another DBA and (or) Architect if they won't support this model. A company that makes a billion dollars from a database can pay the costs to run at least two active instances where reports or processes can be pointed to the currently active one. This also makes required maintenance - like checking integrity - easy since I can run maintenance on one while the other is the database the application uses, then flip the switch when I'm ready to run maintenance on the other.
I've seen a few instances and heard of many more where a manual failover was required in the AlwaysOn model and the application experienced data loss from failure entering the application; I like AlwaysOn for medium priority databases or a behind-the-scenes support of one Active node in an Active-Active model, but when we're talking at least 9 figures in revenue, it's too great of risk to run it. In the case of an Active-Active model, I've never run into a situation where both nodes had corrupt databases, though relative to the hardware there is a probability of this happening. When we're talking about high priority databases, we are dealing with a situation where even one record being lost is too much and the only way to eliminate, or significantly reduce, that risk is architecture designed for it.
In general, architecture, appropriate maintenance tasks and regular disaster recovery drills centered around database corruption to support recovery are required for any low, medium or high priority environment. When designing a model in the beginning large environments, always document your suggestions with the pros and cons of any approach (including your own), along with your concerns. Because none of us have an infinite amount of time, I would highly suggest delineating a limit on support if a team chooses to go a route without consideration of the support required, otherwise you may find yourself supporting a model you recommended against.
Medium Priority Databases
In the case of medium priority databases, I will try to some techniques to first see if I can pass CHECKDB without any data loss. These include, but are not limited to:
- Restoring a database with most recent logs and verifying CHECKDB on another server. The key here is to pin-point the failure window.
- Drop all the non-clustered indexes and re-create them (relative to the CHECKDB error). I've been blessed in my career that this has fixed most of the issues, relative to the error output from CHECKDB.
- Relative to the error and the output of checking integrity on the table level, dropping a table and re-creating it from the backup and restore, if the table is a look-up table where values haven't been added (or won't be).
Low Priority Databases
In the case of low priority databases, I will use some of the methods above in the medium priority category - like re-creating non-clustered indexes - but I know that data loss is allowed and there's a trade off in time if I spend days correcting a database in an environment where we have thousands of databases, and data loss is allowed for the low priority database.
A few other very important points about experiencing this issue:
- Don't keep your backups in only one location. Expect to lose at least one location, or run into corrupted backups from time to time.
- Create a regular restore schedule for your backups - this is much easier in an Active-Active environment, but applies to medium priority environments too. Don't assume that a backup will work, even if it passes RESTOREVERIFYONLY post backup. Only a restore and running CHECKDB verifies database validity.
- Over communicate with the end user and (or) client when this occurs. You want to be clear before this issue, during this issue, and while you're correcting this issue. Communication builds trust.
Other Things to Consider
Finally, I've observed a strong correlation between log micro-management and CHECKDB failures in my experience. I dislike log micro-management, but many environments engage in this and it's an unfortunate reality; the best practice is pre-setting log growth, adjusting the database to simple or full recovery, relative to needs, and teaching developers to batch their processes instead of loading entire data sets. My preference for logs has always been what I call the High-Low-Middle model after the famous pool game: generally, logs should be pre-sized according to configuration set in this model. As data volume grow, this is something developers should know in general. In the case of simple recovery models, I'll force developers to work around the error of ACTIVE_TRANSACTION, while in full recovery models, I'll make sure the log is backing up frequently (still, transactions should be run in smaller batches). There's absolutely no reason to try and add a petabyte of data in one transaction. If you're in an environment with log micro-management, in my experience, CHECKDB failures are only a matter of time.
- Outline your database list from the most important databases where absolutely no loss is acceptable to databases where some data loss may be acceptable.
- Immediately build architecture that supports databases where data loss is not acceptable; the longer you wait on this, the more you increase the odds that you experience massive data loss.
- Check the maintenance tasks you have running and verify that they support the priority of each database.
- Run disaster recovery drills regularly, testing how quickly you can complete a standard set of tasks that you may need to do when experiencing an integrity failure. An example of this would be how long does it take for you to restore a backup to a test environment, grab a look-up table, and migrate it to production after dropping it in production?
- Read related tips here SQL Server Database Consistency Checks Tips
About the author
View all my tips