SQL Server Archive and Data Retention Considerations

By: Tim Smith | Comments (4) | Related: More > Database Administration

Problem

We're planning to archive some old data in some of our SQL Server databases and wanted to know what we should consider before building a design to accomplish this.

Solution

Preliminary SQL Server Data Archive Questions

Some questions to ask:

What types of data are we storing?
Most effective storage approach for our users - how often are they accessing/needing these data?
Most effective storage approach for our environment - what is our environments bottlenecks, limitations and weaknesses and strengths in structure?
Most effective storage approach when considering legal restrictions and requirements. Some environments - regardless of their limitations - must store data for a length of time or in a certain manner to comply with current laws.
Most effective storage approach when we consider costs; for an example, do we need an Enterprise license for our data archived server? Or, should our archive server have an Enterprise license, while the others don't?

We may also be in a situation where we don't want to store data, as our product (or service) offers that as a solution - think of a messaging tool that doesn't store a message an individual sends, so everything is temporary. Finally, depending on frequency of access, we may not want to store data derived from other data (see below example of Simple Moving Average, which are derived from the price). If the derived data are recent and frequently accessed, we may store the data, whereas if the data are seldom accessed, we may work with application developers to use an algorithm to calculate the derived data on each call through the application. I'm always fascinated by clients who store ratios that are derived from data within the structure because it's common, yet generally unnecessary.

SQL Server Archived Data Usage Frequency

We might be tempted to avoid storing data which are seldom (or never) used. A good example of this is the TED spread before 2007-2008; very few people knew what it measured or what the value of it was, but post housing bubble, you could find the TED spread in many different places throughout the media. We can store values in databases that expect to receive fewer queries, though I'd recommend implementing a plan for how to migrate to a heavily accessed database, server, or application approach. If a news study covering the dangers of excessive magnesium went viral tomorrow, we might see a sudden spike in people wanting to see data on magnesium in their foods measured in milligrams and migrating these data easily would allow users to obtain access to the data they need without looking for a competitor. In a similar example, derived data do not need to be stored as values if the cost of calculating them is low - such as the ratio of potassium to sodium in foods; if both sodium and potassium are stored, we can calculate the ratio through a function or process.

Consider that time plays a major part in this as well. Using taxes as an example, very few people need access to their tax records from 15 years ago, while the last five years will see more frequent requests. Depending on our transactional limits, we may be able to store twenty or thirty years' worth of tax information in one table (or database, or server, etc.), but separate five years of taxes in different tables (or databases, or servers, etc.) because of access frequency. We can also change the structure for data that are seldom accessed and consider options such as:

Enabling compression.
Adjust the fill factor settings (if the seldom accessed data cannot change or have new adjustments).
Consider removing some indexes and (or) tuning queries to use minimal indexing.
Memory and disk allocations for data for larger environments willing to spend.

Automate SQL Server Data Archiving Process

Unless we store only a single data structure, more than likely each data structure will have its own archiving process, if applicable. For instance, if we were storing nutrition data for food, we might archive old nutritional reports, such as the old nutrition data for an ounce of almonds (used to be 160 calories; now they discovered it's closer to 129 calories). But if we're storing medical data and nutritional data, more than likely, we'll want to archive the medical data differently than the nutritional data. This is true for many of us who archive the error logs from SQL Server; in some environments, I keep the latest month of error logs, and monthly export the past data to an archive database, which I backup, validate, and keep if I need to access. The below shows a simple example of this process:

USE Archive
GO

DECLARE @dt VARCHAR(9) = '_' + CONVERT(VARCHAR(8),GETDATE(),112)
DECLARE @path VARCHAR(100) = 'N:\baks\errorlogs\archive' + @dt + '.BAK'

SELECT *
INTO Archive.dbo.tb_ErrorLogs
FROM Logging.dbo.tb_ErrorLogs

BACKUP DATABASE Archive
TO DISK = @path

DROP TABLE Archive.dbo.tb_ErrorLogs
TRUNCATE TABLE Logging.dbo.tb_ErrorLogs

I create the error log table in the database Archive, backup the database, then remove the table on archive, and truncate the error log table in the Logging database. The timeframe can differ; I could set this up to where I have backups a year at a time, a quarter at a time, etc. relative to the needs of the environment and the frequency of access.

Different data sets may require different approaches to archiving than the above; for instance, we may want to keep the data live in a database that can be accessed and is not built for high performance. We may also partition the data across multiple databases and servers depending on the users' need - for instance, partitioning video game user data specifically by the version of the game when we find that users want to see the data in the latest version more than older versions (with the oldest version almost never being accessed). These costs can be high, depending on our bottlenecks, such as querying across servers, which is why identifying limitations is an important step.

SQL Server Data Archive Time Frame

A few years ago, I cautioned executives marketing to Millennial about storing too much data on Millennial, as marketing, like investing, can create problems when people have too much information. In investing, the last bubble will generally not be the next bubble, and in generational marketing, one generation tends to differ from the previous generations and in ways that we wouldn't expect. Large patterns summarizing data points can become more important than the data points and too much data can create distractions, headaches, and maintenance that's not needed. This is one reason that stock data are generally stored by day and not by second when looking from a significant past point - what the Dow's fourth second value was on the sixth business day of 1929 tends to be less important than the overall pattern. The key for these environments involves historic data and results: are the historic data providing results to your or the ones using the data?

Next Steps

Evaluate your server environment and its limitations, from hardware to budget to user needs, with the volume of data that need to be archived.
Consider what could change, especially if any change could be legal. For instance, if you're suddenly required to store 20 years instead of 5 years of user data, this change may come with significant costs, if data weren't available.
Schedule data archiving like maintenance, such as re-indexing, on a daily/monthly/annual basis. SQL Server Agent, SSIS or PowerShell offer tools to help you schedule these tasks.

About the author

Tim Smith works as a DBA and developer and also teaches Automating ETL on Udemy.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips

Wednesday, February 2, 2022 - 7:23:19 PM - Russ Hascall	Back To Top (89730)
Thank Tim and Greg! Would it work to generate a sequential record ID for each table using either Row_number() or an Identity column? We could then archive records with a sequence number less than a defined threshold (eg < 200,000). The threshold would vary between tables but seems like this would at least be consistent for us as many tables do not have date columns. Thanks! Russ