Demonstrating the Effects of Using Data Compression in SQL Server

By:   |   Comments (13)   |   Related: > Compression


Problem

You are a DBA or developer mulling over data compression options in SQL Server. You wish to know how much space compressing a particular index will save; you need to know how this will affect query performance.

Solution

This article goes beyond some of the excellent descriptions of row- and page-level data compression options in SQL Server literature and demonstrates the effects of applying these options to moderately-sized data sets. For the demonstrations and to help benchmark your own experiments, I am using SQL Server 2012 RTM Developer Edition on Windows 7 Professional 64-bit, 4 cores @ 2.4GHz, memory capped in SQL Server to a maximum of 2GB for the instance.

Setup

For the setup, I am using a table of two million rows, two columns - an ID column (identity) and a fixed-length CHAR(100) column. The CHAR column contains the letter 'A' 50 times. On the millionth row ordered by UQID, instead of 'AAAA...AAA', is the word 'Bazinga!'. The benchmark is to measure how quickly 'Bazinga!' is found, and how much I/O is involved, in a table scan when the table is not compressed; compressed using row-level compression, and compressed using page-level compression.

Let us first set up the environment. Create a database called SANDBOX (or modify the code to use your preferred test environment) and issue this:

USE MASTER 
GO
ALTER DATABASE SANDBOX SET RECOVERY SIMPLE
GO
CHECKPOINT 
USE SANDBOX 
SET NOCOUNT ON 
GO
IF EXISTS ( SELECT t.name 
   FROM sys.tables t 
   INNER JOIN sys.schemas s ON t.schema_id = s.schema_id 
   WHERE t.name = 'TestData' AND s.name = 'dbo' ) 
 DROP TABLE dbo.TestData 
CREATE TABLE dbo.TestData ( 
 UQID INT IDENTITY(1,1) NOT NULL, 
 SomeCharData CHAR(100) ) 
GO
INSERT INTO dbo.TestData 
 SELECT REPLICATE('A', 50) 
GO 999999 
INSERT INTO dbo.TestData 
 SELECT 'Bazinga!' 
GO
INSERT INTO dbo.TestData 
 SELECT REPLICATE('A', 50) 
GO 1000000

This will create the table to use and will take around ten minutes to run (see above) depending on your system specifications. You can check on the progress in a separate query window by running:

SELECT COUNT(*) FROM SANDBOX.dbo.TestData WITH (NOLOCK)

Next, clear the procedure cache and drop clean memory buffers to 'clear the decks' ready for an unbiased test:

DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS 
GO

Experimentation - No Compression

I then issued the following query to retrieve the row number of the row that has a value 'Bazinga!' in the SomeCharData column.

SELECT UQID 
FROM dbo.TestData 
WHERE SomeCharData = 'Bazinga!'
GO

Here's the execution plan. You can see there are three elements - the table scan, the parallelism gather streams, and the SELECT.

You can see there are three elements - the table scan, the parallelism - gather streams component, and the SELECT.

Now, I don't particularly want parallelism in this case - as I found earlier when experimenting, parallelism will be invoked when the optimizer decides that a parallel query will run faster than a serial query and that the query exceeds the cost threshold for parallelism. This could skew my results if parallelism is invoked for one type of compression but not another (as I observed during testing). Therefore, I'm going to limit the query to one processor by overriding the optimizer with a query hint:

SELECT UQID 
FROM dbo.TestData 
WHERE SomeCharData = 'Bazinga!' WITH OPTION (MAXDOP 1)
GO

I don't particularly want parallelism in this case

This yields a straightforward execution plan with a table scan and SELECT. To keep these tests fair and try to ensure a good statistical sample, I ran this test 15 times, clearing the caches each time. Here are the results collected using SET STATISTICS io ON and SET STATISTICS time ON:

SET STATISTICS io ON and SET STATISTICS time ON

The median of these figures is shown below. I chose the median as the range of each set of values was quite tight and I didn't want any outliers to skew the result, nor decimal places:

I chose the median as the range of each set of values was quite tight

So it's not a bad result, overall. 1.217 seconds in CPU time and 3.38 seconds total time to parse and compile the query, and 0.624 seconds on the CPU with 2.05 seconds to execute the query, a total execution time of 5.43 elapsed seconds. Let's put this on a graph.

A total execution time of 5.43 elapsed seconds.

Now, let's rebuild the table to use row-level compression.

Refresher - Compression Options

Just a quick refresher now for anyone not immediately familiar with compression - I shall attempt to do it justice. Data compression comes in two forms in SQL Server 2012 (actually three, including Unicode, but this isn't a separate option) - row-level compression, and page-level compression. Row-level compression compresses data at row level by applying algorithms to reduce the size of the data encountered. For example, it will convert a fixed-length CHAR to a VARCHAR. Row-level compression is included in page-level compression, but page-level compression uses two other techniques too - prefix compression (finding common values for data in a column) and dictionary compression (finding common values across columns). It is implemented during table definition, but you can also implement it using a REBUILD statement.

Experimentation - Row-Level Compression

Okay, let's move on now and rebuild the table with row-level compression. This is surprisingly simple to do. I will rebuild the table and also clear the caches again following our last query (to remove the plan from cache which will bias the results):

ALTER TABLE dbo.TestData REBUILD WITH ( DATA_COMPRESSION = ROW ) 
GO
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS 
GO

This should only take a few seconds to run. Let's now look at the execution plan for the query to ensure it's the same as when using no compression at all:

et's now look at the execution plan for the query

Let's compare the properties of each element of the execution plan, from when we executed the query without compression and executing with compression. Do the properties (and these are actual execution plans as opposed to estimated ones) reflect any different values considering the compression method differs?

Let's compare the properties of each element of the execution plan

This is interesting - we can already see from the execution plans that there's likely to be a drop in I/O consumption from using row-level compression. This would make sense - if there's less physical data to read, there's less I/O. I ran this experiment 15 times - here's the results below:

we can already see from the execution plans that there's likely to be a drop in I/O consumption

And here's the median results, alongside the non-compressed results:

And here's the median results

Here's the clustered bar graph, for visual comparison:

Here's the clustered bar graph, for visual comparison

This is really interesting - we can see that the time on the CPU to parse and compile increased slightly, with a decrease in the actual elapsed time. In the logical reads there's a reduction, which came down significantly - by nearly 50%. These savings all represent savings in query execution time at a slight cost in CPU time. We'll see how this compares to page-level compression.

Experimentation - Page-Level Compression

Let's move on to the final stage in our experimentation. Let's once again rebuild the table, this time with page-level compression, and clear the caches:

ALTER TABLE dbo.TestData REBUILD WITH ( DATA_COMPRESSION = PAGE ) 
GO
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS 
GO

Let's remind ourselves of the query - you'll note I've not overridden the parallelism this time, you'll see why in a moment:

SELECT UQID 
FROM dbo.TestData 
WHERE SomeCharData = 'Bazinga!' 
GO

Now we can look at the execution plan for the query and see if it remains the same (on the surface):

Now we can look at the execution plan for the query

It does - and without overriding parallelism, it chooses to use serial execution. Why? I would speculate this is because the default cost threshold for parallelism is set to 5, and I haven't changed it in my installation of SQL Server. Let's check the properties - we'll put them side-by-side with row compression as we did with no compression/row compression earlier to see how they compare. I've circled the interesting stats including the operator cost, which is indeed under 5:

I've circled the interesting stats including the operator cost, which is indeed under 5

For this reason, we'll execute the query without overriding parallelism 15 times as before, and examine the results. The properties above seem to imply we'll use even LESS I/O than before (see circled values), so let's see if this bears out. Here are the results:

we'll execute the query without overriding parallelism 15 times as before

And here are the median values, as compared to the no-compression and row-compression median statistics. As you can see, CPU has risen significantly on parse and compile, nearly four times the CPU use without compression at all and more than three times row compression - but with a slight reduction on CPU during execution. But look at the reduced I/O - more than 90% improvement on not using compression at all (bear in mind my data was all 'A's! YMMV) and a huge improvement on I/O against row-level compression. Interesting that row-level compression couldn't achieve better savings but as we know, prefix- and dictionary-level compression techniques apply only to page-level compression, so the savings on row-level likely came from the CHAR- to VARCHAR conversion and the 50% saving on character space.

And here are the median values, as compared to the no-compression and row-compression median statistics

Here's the bar graph to see how it all compares:

Here's the bar graph to see how it all compares

Using Estimations

You'll be glad to know that to estimate the savings from row- and page-level compression, you don't need to go through all this (although it's thorough if you do!). There's a handy system stored procedure you can use, supplied as part of SQL Server, called sp_estimate_data_compression_savings. Call it like this:

EXEC sp_estimate_data_compression_savings 
 @schema_name = 'dbo', 
 @object_name = 'TestData', 
 @index_id = 0, 
 @partition_number = 1, 
 @data_compression = 'row' 
GO

Here are the results of row-level estimations. As you can see, the space estimations are extremely close to those values actually observed (there is a small difference):

Here are the results of row-level estimations.

And here are the results of page-level estimations - again, the variation in space saving is very close to our results:

And here are the results of page-level estimations

Be careful though - what this stored procedure DOESN'T tell you is that CPU use will significantly increase - as we saw, by nearly a factor of four times in the experiments between no compression and page compression! Use this stored procedure - but conduct your own testing too.

Conclusion

As we can see, row- and page-level compression can be powerful tools to help you reduce space taken by your data and improve the execution speed, but at the expense of CPU time. This is because each access of a row or page requires a step to undo the compression (or calculate and match hashes) and this translates directly into compute time. So, when deploying row- or page-level compression, conduct some similar testing (you are welcome to use my framework!) and see how it plays out in your test environment. Your results should inform your decision - if you're already CPU-bound, can you afford to deploy this? If your storage is on fire, can you afford NOT to?

Next Steps

You can find some excellent further reading using the resources below:



sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Derek Colley Derek Colley is a UK-based DBA and BI Developer with more than a decade of experience working with SQL Server, Oracle and MySQL.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips



Comments For This Article




Monday, July 17, 2017 - 5:07:17 PM - eric81 Back To Top (59495)

What about if i have a clustered , non-clustered on my table I'm looking to compress.  When should I look at compressing my indexes?

 

 


Monday, October 19, 2015 - 5:06:42 AM - Cees Meuleman Back To Top (38929)

Based on the Microsoft Technote below I have written a stored procedure which will evaluate the use of the tables and suggest a decision to user ow or page compression. In default it will generate the commands based on the suggestion, but alternatively you can also execute the commands directly. Furthermore you can adjust the parameters on which the decision is based.

This is the first version of this script, so please let me know if u run into issues.

Microsoft technote:
https://technet.microsoft.com/en-us/library/dd894051(v=sql.100).aspx

 

Download stored procedure:
http://tinyurl.com/pf6ol77


Thursday, July 9, 2015 - 4:01:50 PM - JK Back To Top (38167)

 

This article was really helpful and I have one question here. After the Page or Row level compression, incase if we want to update the record and the data size/text is little bit more the previous, then what happens? The data will be stored in the same page (or) new extend will be introduced. Whether it will affect the performance.

 

Just planning to implement this and want to know whether the Page or Row level compression will affect the performance (after the update on the data). 

 

Thanks

 

JK

 

 

 


Thursday, April 3, 2014 - 2:56:15 AM - sri Back To Top (29953)

So the preference should be row compression on a OLTP database which requires system resources for doing multiple transactions where you have a I/O overhead and need to improve the I/O latency.

OLAP databases would be a good candidate for page compressions.


Sunday, March 30, 2014 - 6:45:57 AM - khalilpak Back To Top (29924)

nicely described.


Thursday, March 27, 2014 - 7:38:51 AM - Perry Whittle Back To Top (29898)

there are various whitepapers from Ms and detail in books online around compression. Anyhting that incurs write activity to a certain degree may not make a good candidate for compression. This link http://technet.microsoft.com/en-us/library/dd894051(v=sql.100).aspx details this

A more detailed approach to deciding what to compress involves analyzing the workload characteristics for each table and index. It is based on the following two metrics:

  • U: The percentage of update operations on a specific table, index, or partition, relative to total operations on that object. The lower the value of U (that is, the table, index, or partition is infrequently updated), the better candidate it is for page compression.
  • S: The percentage of scan operations on a table, index, or partition, relative to total operations on that object. The higher the value of S (that is, the table, index, or partition is mostly scanned), the better candidate it is for page compression.

regards Perry

 


Wednesday, March 26, 2014 - 7:45:44 PM - SQL_ME_RICH Back To Top (29895)

Just out of curiosity - the page level compression results also show that the Optimization Level is 'Trival' while row compression and none at all show it being Full.  Anything thoughts on why that would be with the very same query?

 

Also - I did all of these tests in my local DEV 2008 R2 instance, and made 3 databases (ione with no compression, row compression and page compression), and even gave the UQID column a Primary Key with Clustered Index.  Results are all the same.

 

Now I just need to figure out which of my tables has a higher percentage of reads to writes, and see if anything comes of Perry's comments.  Thank you for this research!  It helps us little guys out that have no budget for new hardware, but are expected to make magic happen anyway.


Tuesday, March 25, 2014 - 1:28:15 PM - Derek Colley Back To Top (29875)

@Perry:  Hi Perry, thanks for your feedback.  Fair comment, this article doesn't test write performance on these sorts of tables.  Arguably though in OLTP applications, there's a definite bias towards reads and read performance would take priority over write performance.  SQL Server writes synchronously to pages in the buffer cache and then to the transaction log in any case so not sure if write performance to the table would be impacted since log flushes occur asynchronously.  Perhaps only if there is high simultaneous read load on the same set of tables.  Would speculate that there is still a delay as you say, by the overhead of compressing the data that may manifest as high CPU.

Do you have any specific links/research topics about write performance on compressed tables that you've read that I could follow up?  (leaving aside top Google results for now)?  Would make a good extension to this article.

Thanks

Derek.


Tuesday, March 25, 2014 - 12:04:39 PM - Perry Whittle Back To Top (29873)

Its all well and good for reads, its when you come to writing into these table that the real I\O hits are encountered. The CPU and disk I\O involved to read, uncompress, recompress and write to disk are high and can be a huge performance killer

Just something to bear in mind

Regards Perry


Thursday, March 13, 2014 - 9:24:01 AM - Derek Colley Back To Top (29741)

Thanks everyone for your feedback! :)


Wednesday, March 12, 2014 - 7:51:19 PM - Rick Willemain Back To Top (29732)

Very nicely presented and features demonstarted.

Thank you !


Wednesday, March 12, 2014 - 7:32:00 PM - Tablescan Dan Back To Top (29731)

Excellent post!  It mirrors my testing and production behavior.  We were having I/O issues (as well as a heavy load from BMC Patrol - ugh), and the targeted compression of large, demographic-related tables really helped.


Wednesday, March 12, 2014 - 9:50:44 AM - Jason Back To Top (29725)

Right on. Excellent.















get free sql tips
agree to terms