Lessons learned from SQL Server process that took over 3000 hours

Lessons learned from SQL Server process that took over 3000 hours to complete

By: Andy Novick | Comments (3) | Related: More > DBA Best Practices

Problem

Recently I encountered a situation where I had to re-run 3 years of ETL processes and the related computations to re-create a historical database using revised code and new business rules encapsulated in the metadata. The results were going to be different enough from what had been accumulated over the three years that the client really wanted it re-run. All the code is T-SQL with some SQL CLR routines included. We had improved the speed of our programs during the development period, but I calculated that it was going to take 13 weeks to re-run the whole thing. That is, if everything went well.

Solution

Plan for Interruptions

Start by planning for failure. The program is going to run for a long time so chances are that something is going to interrupt it. Hardware failures, storage failures and network failures are always a possibility, but have become rare events, at least for most of us. The most likely cause of interruptions and the one that got me a couple times is the IT department. They decided that they had to reboot the server without notice. After the first time it happened, I put in a specific request to not have the server rebooted and that request was granted. Somehow that didn't make any difference at all. A couple of weeks later the machine was rebooted again. I was informed only that there was a "Company Policy" that required it. Remember the Boy Scout motto: Be Prepared.

Have Backups

Preparation for failure for database programs usually takes the form of backups. I integrated a set of SQL Server database backups into the process. I used a full backup every two days and differential backups every 4 hours. That left me vulnerable to loosing 4 hours plus time to restore everything. Even if that was 6 hours, it wasn't going to set the project back significantly.

Recovery Model

While we're discussing backups, a step to consider is to convert to SIMPLE recovery model. You might need FULL recovery model for real-time production, but if you're running a program offline based on a recorded set of inputs, why would you need the full transaction log? You might use it to reduce recovery time, but the extra log backups will slow you down and can be difficult to manage. My database was in SIMPLE recovery model to start so I stuck with that.

Run in Isolation

On my Kindergarten report card Mrs. Spidell give be a good grade for "Shares with others". I wouldn't get such a good grade anymore. Long running programs are best run in isolation with as little sharing as possible. Even though it might be tempting to use some of the available resources, try and run any long running program in isolation. Push the other kids out of the sandbox for the duration of the run.

Don't Tweak the Process

When you look at an estimate of 13 weeks to complete a process there's always a temptation to try and tweak the process to make it just a little bit faster. Resist! Go with the code that you know works. The chance of introducing errors or other types of program failures is more important than shaving a week or two off the run time.

Build Real Estimates

Estimates shouldn't come out of thin air. Try and build a model of how long the process will take based on the best information you have. It doesn't have to be perfect, but every week you're going to be asked how long it's going to take. It's good to have an answer that isn't just a guess. I based mine on the number of input files processed.

Verify Results

You're going to verify the results somehow, right? Of course you are. If at all possible, build the verification into the process and have it run periodically. Don't wait until the end to run the verification because you don't want to go three months and find out that something went wrong in week one. Another benefit of integrating the tests into the process is that once you're done, you might really be done and not at the start of another lengthy process.

Use All Available Time

The clock has 168 hours in a week, but there are only 40 in the standard work week. Be ready to be available evenings and weekends. Things are going to happen when you're not working and waiting until standard work hours to fix issues or help the process along will add to the duration of the project. That might be OK. If it is, maybe you don't need to check in off hours. I'm pretty sure the process would have taken 1/3 more time, if I hadn't been available to do my part when it was needed. Usually my only task was deleting old backups, but it's important that what ever your part, it gets done in a timely manner when the system is ready.

Have Some Fun

Even though it's a long slog, try and have some fun with the project. Near the start I set up a contest for guessing when the program would be complete. The prize was a 20 ounce can of peanuts like the ones that I had given out during the Holidays last year. This gave everyone on the team a rooting interest in the outcome. When the program hit the 50% mark I got the group to sing Bon Jovi's "Livin' On A Prayer" which features the lyrics "Woah, we're half way there". You've heard it. Group sings are fun as long you keep them short and keep the music loud enough to not hear much of the actual singing.

Success!

My project completed in about 11 weeks, which was a little faster than the estimate of 13 weeks. It's usually better to be faster rather than slower, but with a process this long, estimates are difficult. There's no reason to be overly optimistic in your projections so err on the high side.

Next Steps

Build your model of how long your process takes before starting such a process
Line up your resources and isolate them from other processes
Plan your backup strategy
- SQL Server Backup Tutorial
- SQL Server Backup Tips
Figure out how you're going to verify as you go

About the author

Andy Novick is a SQL Server Developer in the Boston area with 25 years of database and application development experience.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips

Comments For This Article

Wednesday, January 10, 2024 - 5:20:04 AM - Thomas Franz

Split the workload in smaller batches (e.g. per day or per 2 mio rows; cursors are fine in this case) - never try to run a single statement that takes more than maybe one hour.

Otherwise
* your transaction log may explode / fill the whole disk since even in SIMPLE RECOVERY mode it will not be truncated / reused until the longest running transaction is commited
* when something went wrong and it has to roll back the whole stuff, you are wasting much time. ROLLBACKS takes usually the double time of the original statement, so if you have to rollback a 6 hour transaction, you may waiting another 12 hours and have wasted 18 hours of worktime
* same when the server / network / whatever crashes and the transaction is rolled back automatical

Create indexes that fits your workload batches. It makes no sense to scan a whole 20 TB table 1000 times because you have to do a complete Table / Clusted Index scan for every batch.

And write logs not only to the console (the PC / VM / SSMS where the statement runs, may crash) but to a table in another small database (!) that you can and do backup / restore separate (ideally in FULL RECOVERY mode), so you can check what / when it went wrong and what you may have to do / prepare, while you are waiting while the main database is restored / statements rolls back etc.

Try to not use explitic BEGIN TRANSACTION / COMMIT except for smaller batches. It slows down stuff, creates locks, bloats your transaction log etc. (I have already seen an explicit transaction outside (!) of the cursor that breaks up the stuff into smaller, more managable batches)

Create some sort of external control of the process. E.g. you can create a control table where you may insert a simple varchar that says "STOP". And in the main process you regularly (at least at the begin of the batch loop) you check this table and BREAK the loop when you find the STOP value in the control table. Every time a really important security fix can be released or some other maintenance happen, where you want to have a better way than pressing Alt-Break to shut down your process in a controlled way - maybe with some extra wait time (until the current step is done), but without wasting real time by canceling it at 95% of the current batch and waiting until everything is rolled back and redoing it later..

Thursday, June 30, 2016 - 1:53:07 AM - Constantine	Back To Top (41789)
Wow, 11 weeks, beyond my imagination! How many terabytes of data was there to be loaded?

Monday, June 27, 2016 - 12:24:47 PM - Aleksei

Great post, thanks.

I personally would add "Output progress at least to console" (or maybe even RAISERROR with lowest severity) to this. The estimation still meets real world :)