Lessons learned from SQL Server process that took over 3000 hours to complete
Recently I encountered a situation where I had to re-run 3 years of ETL processes and the related computations to re-create a historical database using revised code and new business rules encapsulated in the metadata. The results were going to be different enough from what had been accumulated over the three years that the client really wanted it re-run. All the code is T-SQL with some SQL CLR routines included. We had improved the speed of our programs during the development period, but I calculated that it was going to take 13 weeks to re-run the whole thing. That is, if everything went well.
Plan for Interruptions
Start by planning for failure. The program is going to run for a long time so chances are that something is going to interrupt it. Hardware failures, storage failures and network failures are always a possibility, but have become rare events, at least for most of us. The most likely cause of interruptions and the one that got me a couple times is the IT department. They decided that they had to reboot the server without notice. After the first time it happened, I put in a specific request to not have the server rebooted and that request was granted. Somehow that didn't make any difference at all. A couple of weeks later the machine was rebooted again. I was informed only that there was a "Company Policy" that required it. Remember the Boy Scout motto: Be Prepared.
Preparation for failure for database programs usually takes the form of backups. I integrated a set of SQL Server database backups into the process. I used a full backup every two days and differential backups every 4 hours. That left me vulnerable to loosing 4 hours plus time to restore everything. Even if that was 6 hours, it wasn't going to set the project back significantly.
While we're discussing backups, a step to consider is to convert to SIMPLE recovery model. You might need FULL recovery model for real-time production, but if you're running a program offline based on a recorded set of inputs, why would you need the full transaction log? You might use it to reduce recovery time, but the extra log backups will slow you down and can be difficult to manage. My database was in SIMPLE recovery model to start so I stuck with that.
Run in Isolation
On my Kindergarten report card Mrs. Spidell give be a good grade for "Shares with others". I wouldn't get such a good grade anymore. Long running programs are best run in isolation with as little sharing as possible. Even though it might be tempting to use some of the available resources, try and run any long running program in isolation. Push the other kids out of the sandbox for the duration of the run.
Don't Tweak the Process
When you look at an estimate of 13 weeks to complete a process there's always a temptation to try and tweak the process to make it just a little bit faster. Resist! Go with the code that you know works. The chance of introducing errors or other types of program failures is more important than shaving a week or two off the run time.
Build Real Estimates
Estimates shouldn't come out of thin air. Try and build a model of how long the process will take based on the best information you have. It doesn't have to be perfect, but every week you're going to be asked how long it's going to take. It's good to have an answer that isn't just a guess. I based mine on the number of input files processed.
You're going to verify the results somehow, right? Of course you are. If at all possible, build the verification into the process and have it run periodically. Don't wait until the end to run the verification because you don't want to go three months and find out that something went wrong in week one. Another benefit of integrating the tests into the process is that once you're done, you might really be done and not at the start of another lengthy process.
Use All Available Time
The clock has 168 hours in a week, but there are only 40 in the standard work week. Be ready to be available evenings and weekends. Things are going to happen when you're not working and waiting until standard work hours to fix issues or help the process along will add to the duration of the project. That might be OK. If it is, maybe you don't need to check in off hours. I'm pretty sure the process would have taken 1/3 more time, if I hadn't been available to do my part when it was needed. Usually my only task was deleting old backups, but it's important that what ever your part, it gets done in a timely manner when the system is ready.
Have Some Fun
Even though it's a long slog, try and have some fun with the project. Near the start I set up a contest for guessing when the program would be complete. The prize was a 20 ounce can of peanuts like the ones that I had given out during the Holidays last year. This gave everyone on the team a rooting interest in the outcome. When the program hit the 50% mark I got the group to sing Bon Jovi's "Livin' On A Prayer" which features the lyrics "Woah, we're half way there". You've heard it. Group sings are fun as long you keep them short and keep the music loud enough to not hear much of the actual singing.
My project completed in about 11 weeks, which was a little faster than the estimate of 13 weeks. It's usually better to be faster rather than slower, but with a process this long, estimates are difficult. There's no reason to be overly optimistic in your projections so err on the high side.
- Build your model of how long your process takes before starting such a process
- Line up your resources and isolate them from other processes
- Plan your backup strategy
- Figure out how you're going to verify as you go
About the author
View all my tips