Identify SQL Server Database Growth Rates

By:   |   Updated: 2015-08-14   |   Comments (2)   |   Related: More > Database Administration


Problem

Unchecked SQL Server database growth can hurt the bottom line of organizations in the form of poor application performance and increased infrastructure costs. Yet data-growth monitoring is often an afterthought in the day-to-day operations for SQL Server DBAs. To make matters worse, inclusion of data-archiving/pruning mechanisms is frequently not a requirement in the design and implementation phases of new application development. Consequently, live application databases often grow unnoticed to the point when unplanned measures have to be taken to prevent disruptions of service. Is there an easy way to proactively obtain a recent list of the fastest-growing databases in a SQL Server instance using information captured by default in the database engine?

Solution

T-SQL window functions were first introduced in SQL Server 2005, and have been enhanced in newer versions of the product. As their name suggests, these functions allow analysis over "windows" or ranges of data, something that is otherwise possible only through complex, and often poorly performing, subqueries or common-table expressions (CTEs). In addition to improved query performance and simplified code, window functions make it easy to extract data aggregations along with individual data points in the same result set, a point discussed by Fabiano Amorim here.

In this tip I show how to harness the power of T-SQL window functions to calculate the average growth rates of full database-backup sizes from system table msdb..backupset. Erin Stellato has previously demonstrated how to use backupset to trend the growth of individual databases over time. I am building on this approach by calculating differences between adjacent time points (backup size difference between today and yesterday, for example) and then averaging these differences for each database to come up with a list of the fastest-growing databases in a SQL Server instance. For this I am using the LAG window function, added in SQL Server 2012, so the solution presented here applies to SQL Server 2012 and higher.

Implementation

First I need to use the LAG function to calculate differences in backup sizes between adjacent time points for each database in the SQL Server instance. LAG provides access to a previous record in the same result set as the current record and preceding the current record by a specified offset. The code snippet looks like this:

-----------
--Script 1:
-----------
...
,	LAG( s.[backup_size] ) 
		OVER ( PARTITION BY s.[database_name] ORDER BY s.[backup_start_date] ) AS [Previous Backup Size]
FROM
	[msdb]..[backupset] s
...

In the above script I apply the LAG function on the [backup_size] column of the [backupset] table with the default offset of 1 to get the immediately previous row in the result set by [backup_start_date]. The "OVER ( PARTITION BY ... ORDER BY ... )" construct ensures meaningful comparisons by restricting each result set to one database and ordering these records by backup start date to achieve chronological sequence. I can similarly partition by database name to obtain the COUNT of records (sample size) per database in the data set. Here is the full query returning current and previous row (in megabytes), as well as sample size, for all databases:

-----------
--Script 2:
-----------
SELECT
	s.[database_name]
,	s.[backup_start_date]
,	COUNT(*) OVER ( PARTITION BY s.[database_name] ) AS [Sample Size]
,	CAST( ( s.[backup_size] / 1024 / 1024 ) AS INT ) AS [Backup Size (MB)]
,	CAST( ( LAG( s.[backup_size] ) 
		OVER ( PARTITION BY s.[database_name] ORDER BY s.[backup_start_date] ) / 1024 / 1024 ) AS INT ) AS [Previous Backup Size (MB)]
FROM
	[msdb]..[backupset] s
WHERE
	s.[type] = 'D' --full backup
ORDER BY
	s.[database_name]
, 	s.[backup_start_date];
GO

In my SQL Server instance a subset of the results looks as follows:

database_name backup_start_date Sample Size Backup Size (MB) Previous Backup Size (MB)
... ... ... ... ...
Analytics 2015-06-05 22:39:21.000 30 639002 NULL
Analytics 2015-06-06 22:43:16.000 30 639505 639002
Analytics 2015-06-07 22:45:48.000 30 637662 639505
... ... ... ... ...
Operations 2015-06-06 02:08:09.000 30 9892 NULL
Operations 2015-06-07 04:36:35.000 30 9895 9892
Operations 2015-06-08 01:45:40.000 30 9916 9895
... ... ... ... ...
Sales 2015-06-05 22:00:01.000 30 102695 NULL
Sales 2015-06-06 22:00:00.000 30 102813 102695
Sales 2015-06-07 22:00:02.000 30 102532 102813
... ... ... ... ...

NULL values for the [Previous Backup Size (MB)] column reflect the fact that those are the very first records in sequence for that database, so there are no "previous" values. The Sample-Size value of 30 means that there are 30 records of successful backups for that database in backupset.

Now that I have this intermediate set of data, I am ready to return the average, min and max differences between "current" and "previous" backup size. To restrict each result to data from a single database, I again use the OVER ( PARTITION BY ... ) construct, partitioning by database name.

Here is the final script:

-----------
--Script 3:
-----------
SELECT DISTINCT
	A.[database_name]
,	AVG( A.[Backup Size (MB)] - A.[Previous Backup Size (MB)] ) OVER ( PARTITION BY A.[database_name] ) AS [Avg Size Diff From Previous (MB)]
,	MAX( A.[Backup Size (MB)] - A.[Previous Backup Size (MB)] ) OVER ( PARTITION BY A.[database_name] ) AS [Max Size Diff From Previous (MB)]
,	MIN( A.[Backup Size (MB)] - A.[Previous Backup Size (MB)] ) OVER ( PARTITION BY A.[database_name] ) AS [Min Size Diff From Previous (MB)]
,	A.[Sample Size]
FROM 
(
	SELECT
		s.[database_name]
	--,	s.[backup_start_date]
	,	COUNT(*) OVER ( PARTITION BY s.[database_name] ) AS [Sample Size]
	,	CAST ( ( s.[backup_size] / 1024 / 1024 ) AS INT ) AS [Backup Size (MB)]
	,	CAST ( ( LAG(s.[backup_size] ) 
			OVER ( PARTITION BY s.[database_name] ORDER BY s.[backup_start_date] ) / 1024 / 1024 ) AS INT ) AS [Previous Backup Size (MB)]
	FROM 
		[msdb]..[backupset] s
	WHERE
		s.[type] = 'D' --full backup
	--ORDER BY
	--	s.[database_name]
	--,	s.[backup_start_date]
) AS A
ORDER BY
	[Avg Size Diff From Previous (MB)] DESC;
GO

Results:

database_name Avg Size Diff From Previous (MB) Max Size Diff From Previous (MB) Min Size Diff From Previous (MB) Sample Size
Analytics 561 6016 -5050 30
Sales 49 955 -371 30
Operations 2 13 -2 30
... ... ... ... ...

Over the retention time span of data in the backupset table (30 records of daily backups in this example), the Analytics database exhibits the highest average daily increase (growth rate) in backup size at 561 MB. Analytics is therefore the fastest-growing database in the SQL Server instance over the past 30 days. The Sales and Operations databases come second and third, respectively. If performance issues with the Analytics database had started creeping up recently, its large growth rate would be one of the clues to pursue further. I could now begin to dig deeper into specific tables within Analytics driving this growth, and would work with application developers to come up with archiving/pruning strategies to mitigate it in the long term.

Conclusion

Proactively addressing issues with database growth not only helps rein in infrastructure costs, but can be crucial for optimal application performance. In this tip I have shown how to easily monitor growth by combining information from the msdb..backupset table with the LAG window function to come up with a snapshot of the fastest-growing databases in a SQL Server (2012 and higher) instance. Depending on the retention interval of backupset, one may wish to poll these results at scheduled intervals and persist them in a permanent table for analysis over longer periods of time.

Next Steps


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Marios Philippopoulos Marios Philippopoulos has been a SQL Server DBA for over 10 years. He is based in the Toronto area, Canada.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips


Article Last Updated: 2015-08-14

Comments For This Article




Tuesday, September 1, 2015 - 6:04:52 PM - Marios Philippopoulos Back To Top (38582)

Hi Jeff,

Thanks for the comments. I totally agree that it is much more preferable if one can collect data on individual file and/or table sizes over time.

This will give much more accurate predictions, especially in situations, as you mention, where the database data files span multiple volumes.

The approach described in the tip is intended in scenarios where such monitoring has not been set up - it is a quick and dirty way of getting information on database growth for systems where a drive has suddenly filled up, and one needs to find out what db(s) is/are the most likely culprit.

Again, not the ideal way of monitoring growth but a good way of getting some info "after the fact", when more detailed monitoring had not already been set up.

Marios


Wednesday, August 26, 2015 - 6:37:06 PM - jeff_yao Back To Top (38543)

Nice article, but I to me, capacity planning based on database as a whole is usually not sufficient enough in most complex big database environments. Two reasons here:

1. Big databases usually have multiple data files (esp. when you have partitions and filegroup for blob data) speaded on different drives.

2. Many data files may serve as archives, i.e. read-only, thus no growth.

One scenario I experienced before:

One database of 1+ TB size (log file not counted) with 5 data files, of which 2 are active (one data and another for index, each onf different drive). The the index file (let's say it is 20GB initially) almost tripled to 60GB in  6 months, while the active data file (initallyy at 100GB) doubled in the same period. If capacity planning using the whole db size as a starting point, the final prediction will be skewed to non-sufficiency capacity in another 6 months. But if we use individual file sizes as a  starting point, we will get more accurate predictions.

In real world, I even go further to table level size collection (esp. when some tables have blob columns, and are the major contributor to data growth), and thus I can get even better capacity planning under the context of business for these tables.















get free sql tips
agree to terms