Learn more about SQL Server tools

mssqltips logo
 

Tutorials          DBA          Dev          BI          Career          Categories          Webcasts          Whitepapers          Today's Tip          Join

Tutorials      DBA      Dev      BI      Categories      Webcasts

DBA    Dev    BI    Categories

 

Unicode Compression in SQL Server 2008 R2


By:   |   Last Updated: 2010-05-28   |   Comments   |   Related Tips: 1 | 2 | 3 | 4 | More > Compression

Problem

Databases are getting bigger. You might argue that hard disks are getting bigger also; however (and putting Solid State Drives out of the equation for the moment) they are not getting faster. In fact most systems today, even those on big expensive SANs and other highly scalable disk subsystems, are I/O-bound rather than CPU- or memory-bound. To compound the problem, we have to support Internationalization -- meaning we need to use NVARCHAR columns to support Unicode data, even if most of our column values only contain ASCII characters -- leading to much wasted space and, more importantly, I/O that is much heavier than necessary.

Solution

Using data compression in SQL Server 2008 R2, not only can we benefit from row (storing fixed-length data as variable-length) and page (adding prefix and dictionary) compression, but we can also observe that ASCII characters will only occupy a single byte even in NCHAR and NVARCHAR columns - only those characters that require two bytes will actually use them. On systems where Unicode data is the exception rather than the rule, this can represent a significant and dramatic difference in both storage and performance.

Using a very simple example for illustration, let's say we have the following table:

CREATE TABLE dbo.Uncompressed
(
 data NVARCHAR(255)
);

In order to test data compression, I'll need to populate that table with realistic data. Typically I'll use catalog views such as sys.tables or sys.columns to populate dummy data; in this case, I happened to have a nice round number of rows in sys.objects (60), so in order to create a sample case where 25% of the data contained Unicode characters, I created 20 additional tables with names like:

CREATE TABLE dbo.[bar](a INT); 
CREATE TABLE dbo.[xx](a INT); 
CREATE TABLE dbo.[xx](a INT); 
...

Then to ensure that the data would be inserted into both my compressed and uncompressed tables in the same order (so that both tables would be storing the exact same data), I populated the following #temp table with 512,000 rows (using MAXDOP 1 to prevent parallelism from changing the sorting and identity assignment):

CREATE TABLE #tempdata
(
 i INT IDENTITY(1,1),
 data NVARCHAR(255)
);
INSERT #tempdata(data) 
 SELECT s1.name
 FROM sys.objects AS s1
 CROSS JOIN sys.objects AS s2
 CROSS JOIN sys.objects AS s3
 ORDER BY NEWID()
 OPTION (MAXDOP 1);
GO

Now I could populate my uncompressed table, and run a preliminary test against it:

INSERT dbo.Uncompressed(data) 
 SELECT data FROM #tempdata
 ORDER BY i
 OPTION (MAXDOP 1);

My only real tests would be: how long does this take, and what kind of compression does the engine predict I will be able to achieve on this table? The INSERT took an elapsed time of 2.899 seconds (1.594 seconds of that was CPU time). You can estimate the compression ratio by using the system stored procedure sys.sp_estimate_data_compression_savings:

EXEC sys.sp_estimate_data_compression_savings 
  @schema_name      = 'dbo',
  @object_name      = 'Uncompressed',
  @index_id         =  0,
  @partition_number =  1,
  @data_compression =  'PAGE';

The results indicated that I should get a compression ratio of about 64%:

The results indicated that I should get a compression ratio of about 64%

The question is, since the stored procedure bases its estimates on a small sample, would I get that same ratio when I actually applied compression? Well there is an easy way to test; I could create a second table, and then insert the same data:

CREATE TABLE dbo.Compressed
(
 data NVARCHAR(255)
)
WITH (DATA_COMPRESSION = PAGE);
GO
INSERT dbo.Compressed(data) 
 SELECT data FROM #tempdata
 ORDER BY i
 OPTION (MAXDOP 1);

As you might expect, this operation took longer; total elapsed time was 5.291 seconds (2.062 seconds of CPU time). We also find that the space we saved was not quite as pronounced as the system procedure estimated; the actual space savings in this case came out to about 40%. I checked this by running the following query:

SELECT [Table] = OBJECT_NAME([object_id]),
 [Pages] = reserved_page_count,
 [Size in KB] = reserved_page_count*8
 FROM sys.dm_db_partition_stats
 WHERE OBJECT_NAME([object_id]) IN ('Compressed', 'Uncompressed')
 AND index_id = 0;

With the following results:

Using data compression in SQL Server 2008 R2

So far we see that we can save a dramatic amount of space, but write times are adversely affected. What about read times? For brevity, let's just test a full scan of the table, and see how much longer it takes to decompress and read the compressed data, vs. reading the uncompressed data directly:

SET STATISTICS TIME ON;
SELECT [data] FROM dbo.Compressed;
SELECT [data] FROM dbo.Uncompressed;
SET STATISTICS TIME OFF;

The results:

SQL Server Execution Times:
   CPU time = 484 ms,  elapsed time = 4015 ms.
   
 SQL Server Execution Times:
   CPU time = 312 ms,  elapsed time = 3943 ms.

So, there is significant CPU cost to reading and decompressing the compressed data, but and due to lower I/O (~1,300 logical reads vs. ~2,200), it does not impact overall duration as much as you might expect. Keep in mind that this was an isolated and virtualized system with a suboptimal I/O subsystem; in better-tuned environments, the duration may very well swing the other way.


All of which just reminds us that nothing is free, and compression / decompression will certainly take a CPU toll on certain operations. But on systems where you're already I/O-bound, you may be better off paying a CPU performance hit in order to reduce I/O. In addition, since data is kept compressed in the buffer pool (and only decompressed when it is read), you will find that your memory requirements will decrease as well, meaning you will be able to get more data into memory without actually adding RAM. When asking yourself if data compression is right for you, just keep in mind this important point: As with all architectural changes, "it depends" - you will need to gauge the feasibility of this feature, given your hardware, data, workload and usage patterns.

Note that there are two important limitations of Unicode compression that I feel cannot be ignored. One is that it does not support the NVARCHAR(MAX) data type; only NCHAR and NVARCHAR <= 4000. Another is that you cannot enable page or row compression without also enabling Unicode compression (it does not have its own setting).

Next Steps


Last Updated: 2010-05-28


next webcast button


next tip button



About the author
MSSQLTips author Aaron Bertrand Aaron Bertrand (@AaronBertrand) is a Product Manager at SentryOne, with industry experience dating back to Classic ASP and SQL Server 6.5. He is editor-in-chief of the performance-related blog, SQLPerformance.com, and serves as a community moderator for the Database Administrators Stack Exchange.

View all my tips





Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name    *Email    Email me updates 


Signup for our newsletter
 I agree by submitting my data to receive communications, account updates and/or special offers about SQL Server from MSSQLTips and/or its Sponsors. I have read the privacy statement and understand I may unsubscribe at any time.



    



Learn more about SQL Server tools