SQL Server Clustered Columnstore Index Examples for ETL

Problem

We run multiple ETL environments all of which involve imports, aggregates, and data migrations of billions of records a day. In your webinar, you mentioned clustered columnstore indexes as a tool that some ETL environments may be able to use and since we’re using SQL Server versions 2014 and 2016, both of which have these features, we wanted to know what clustered columnstore indexes may be able to offer us and when we should consider them.

Solution

Columnstore indexes cover a full table and structure and manage data by column, as well as query the data by column. In contrast with how we typically think of a set (a set of rows), a comparable analogy with a set in this case would be that the column is the set. Some data within a column is repetitive, which allows for compression of those data; for an example, if we have a column with 10 billion records of integers that range from 1 to 101,000, we have very little distinct data. This allows for significant compression and due to the compression, lower resource usage – such as disk, memory, CPU, etc in some cases. Since we are thinking about a set of columns over a set of rows in this case, our queries should involve most or all of the records.

As new records are inserted into the table with a clustered columnstore index, the data are compressed – though there are times where a batch of records are not compressed up to a certain point. Vivek Anil, a Microsoft Engineer, pointed out that we could keep an eye on the sys.column_store_row_groups for administration purposes, as this provides us with information on the data compressed and the uncompressed data (delta_store_hobt_id shows the ID for data open in the delta store that have not been compressed). As data move in or are removed from the table, we may need to rebuild the index. A key difference here is that with a clustered columnstore index, we’ll be rebuilding one index, since it covers the table.

When To Consider Columnstore Indexes

When we have queries that use a significant portion of the overall data stored – such as queries that use 90-100% of the full table. Using an example with this, if we have a table with 1 billion records and most of your analytic queries use 900 million to 1 billion records, a columnstore index might be useful for these queries.
When memory, disk space or CPU use may be restricted, or may be a bottleneck for queries or analysis. Because columnstore indexes allow for significant compression, relative to the data type, the costs for storing these data and running processes analyzing these data may be reduced.
When we have similar data within a column, or data that compresses at a high ratio through experimentation. Since the data are stored and structured by column.
In ETL environments where we archive all of our data, which leads to significant storage costs. A clustered columnstore index will result in savings with storage, since the data are compressed. Provided that we seldom access these archived data, this may be a favorable option to indexes that use more space, or tables without any indexes that have no compression.

There are some drawbacks to clustered columnstore indexes:

Not all data types are supported: Unique identifier, varbinary/varchar/nvarchar max length, xml, sql_variant, ntext, image, text, timestamp and rowversion.
In testing speed and performance, there may be situations where these indexes will show poor results. A live report that’s refreshed by the second would be better served by other architecture, such as an in-memory table.

When To Pass On Columnstore Indexes

Outside of when data aren’t compatible or we’re using an incompatible version, some considerations when we may not want to use clustered columnstore indexes:

When queries use a small fraction of the table or involve single records over aggregates. An example would be a 1 billion row table where 99% of the queries against it involve no more than 2 records per query.
When testing invalidates the need: if storage or resources are not a bottleneck and queries are much faster without it in testing, there’s little need to use this type of indexing.

Examples

In this section, we’ll look at two examples – one of which will involve your environment and one of which is a simulated example. In your development environment, look for tables that involve large query analysis relative to their size, are large, and meet the criteria by only having data types supported. Next, follow the steps:

Create a copy of the table, either by the CREATE TABLE and INSERT INTO syntax or a SELECT * FROM YourLargeTable INTO YourLargeTableCopy syntax. The second is faster.
Add a clustered columnstore index to the copy table. You will have two tables: the original table with clustered and non-clustered indexes and the table with the clustered columnstore index.
Run the same queries that you run against your large table and compare the results in measurement of speed and resource usage by tracking details when run against both tables. Compare the execution plan, statistics IO, time, waits, and CPU/Memory/Disk usage. This will help you determine if a clustered columnstore index would be useful in your situation.

In the next step, we’ll look at the clustered columnstore index on a table we populate with 1 million random records. In the comments, we see how we generate larger random numbers and our table holds three values – a random date, and two random values, both of which are bigint. In our first test, we add two indexes to two columns and in the second test, we use a clustered columnstore index. This contrived example helps us compare, but we should consider that in many cases we cannot add indexes to a large percent of columns in big data ETL environments – some of these environments have 100s of columns per table and the storage cost (as we later see) would be impractical when we also consider the data storage cost. In other words, if our queries against our tables involve most or all of the records and we have 100s of columns, adding indexes to all the columns to involve index scans over table scans won’t be practical in most situations.

 CREATE TABLE CCITable (ValueOne BIGINT, ValueTwo BIGINT, ValueDate DATETIME)
/*
SELECT 
 NEWID()
 , --CHECKSUM(NEWID()) CheckSumMethod
 , --ABS(CHECKSUM(NEWID())) AbsoluteValueCheckSumMethod
 , --CAST(((RAND()*0.00001) * ABS(CHECKSUM(NEWID()))) AS INT) ---- RAND() * 0.00001 will reduce number size
SELECT @rand = CAST(((RAND()*0.00001) * ABS(CHECKSUM(NEWID()))) AS INT
*/
DECLARE @rand1 BIGINT, @rand2 BIGINT, @b INT = 1
WHILE @b <= 1000000
BEGIN
 SELECT @rand1 = CAST(((RAND()*0.00001) * ABS(CHECKSUM(NEWID()))) AS BIGINT)
 SELECT @rand2 = CAST(((RAND()*0.0001) * ABS(CHECKSUM(NEWID()))) AS BIGINT)
 
 INSERT INTO CCITable VALUES (@rand1,@rand2,DATEADD(DD,-(@rand1/10),GETDATE()))
 
 SET @b = @b + 1
END
SELECT TOP 10 *
FROM CCITable
CREATE NONCLUSTERED INDEX IX_CCITable_ValueDate ON CCITable (ValueDate)
CREATE NONCLUSTERED INDEX IX_CCITable_Value ON CCITable (ValueOne)
SELECT 
 MIN(ValueOne)
 , MAX(ValueOne)
 , AVG(ValueOne)
 , STDEV(ValueOne)
 , SUM(ValueOne)/(AVG(ValueOne))
 , MIN(ValueTwo)
 , MAX(ValueTwo)
 , AVG(ValueTwo)
 , STDEV(ValueTwo)
FROM CCITable
DROP INDEX IX_CCITable_ValueDate ON CCITable
DROP INDEX IX_CCITable_Value ON CCITable
CREATE CLUSTERED COLUMNSTORE INDEX CCI_CCITable ON CCITable
SELECT 
 MIN(ValueOne)
 , MAX(ValueOne)
 , AVG(ValueOne)
 , STDEV(ValueOne)
 , SUM(ValueOne)/(AVG(ValueOne))
 , MIN(ValueTwo)
 , MAX(ValueTwo)
 , AVG(ValueTwo)
 , STDEV(ValueTwo)
FROM CCITable
DROP TABLE CCITable

Table ‘CCITable’. Scan count 1, logical reads 4086, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table ‘CCITable’. Scan count 1, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 1121, lob physical reads 2, lob read-ahead reads 3428.

Depending on our needs, the clustered columnstore index may offer more advantages than the clustered and nonclustered index options in a big data ETL environment.

Next Steps

Consider the types of queries that you run in your ETL environment: queries that involve the table, queries that involve a fraction of the table, or both? This will help you decide what route to choose.
Consider your bottlenecks and queries: some queries may benefit from clustered or non-clustered indexes since their SARGability can minimize the amount of records returned. In some situations, space can be an issue so even if a table with a clustered columnstore index is slower with some queries, the reduced size makes the preferred index to use.
Clustered columnstore indexes are avaiable in versions SQL Server 2014 and beyond.

Tim Smith

Tim consults for FinTek Development and teaches the course Automating ETL on Udemy. He’s worked with technology since high school, helping his school win its first TCEA award and continues to work in automation, data architecture, back-end development, and smart contract architecture. Tim enjoys testing new technologies early in the diffusion of innovation curve and was an early adopter of NoSQL and smart contract development. He has a blog at http://www.fintekdev.com/ and helps contribute to local technical and financial events in Texas.

When To Pass On Columnstore Indexes

Examples

Leave a ReplyCancel Reply