SQL Server Ordered Columnstore Indexes for Data Aggregation in Data Warehouses

By:   |   Updated: 2023-06-19   |   Comments   |   Related: > Indexing


Problem

Do you remember your favorite birthday present? Mine was getting a Super Nintendo. I still occasionally play Super Mario World, the best video game ever created. The best gift Microsoft gave me was columnstore, one of my favorite features introduced in SQL Server. The product got even better when Microsoft released SQL Server 2016 with the ability to create an updatable nonclustered variety. Something always felt lacking: the ability to order when creating the index. Sure, there's a workaround to achieving such an order, but it seemed like something was missing. What's the primary reason you want the ability to order? You can boil it down to achieving segment elimination. When Microsoft revealed SQL Server 2022 offers the ability to order a columnstore index, I thought, finally. However, is it everything I hoped for?

Solution

In this tutorial, we'll briefly look at what a columnstore index is and when you would want to use one. I'll touch on the internal storage mechanism and how chunks of columns are coded and compressed into segments. Like standard data pages in SQL Server, you want to return as few as possible. It's like having a choice in school to write 5 or 10 pages for a report. Which one are you going to choose? Before SQL Server 2022, you needed to load data in a particular fashion to eliminate a segment. With the latest and greatest, there is an option to order a columnstore index on one or more columns.

What Is Columnstore?

Columnstore is a type of record storage where columns are saved to data pages using a columnar format versus entire rows. An easy way to think about it is rowstore saves records horizontally, and columnstore saves them vertically. Most often, columnstore is associated with a data warehouse type of environment. You even see entire database platforms where columnstore is the focus, such as Snowflake, Casandra, and Azure Synapse.

Microsoft first introduced columnstore in SQL Server 2012. However, 2016 SP1 provided the ability to create an updatable nonclustered index. This feature added a massive improvement since it meant you could create a columnstore index on a table with a clustered B+ tree index. Finally, having a hybrid OLTP and data warehousing environment was a reality.

What's Columnstore Good For?

Columnstore shines when it comes to aggregating data over lots of rows. When I say lots, I'm talking tens, if not 100s of millions or even billions. It's also ideal if the column's values repeat. For example, a column with an integer representing the 50 states in the US over a table of 100 million would be repeated millions of times. This is great for allowing the compression mechanism to work. You'll see no noticeable benefit if you create a columnstore index on a table under two to three million rows. On the contrary, it will just add overhead. However, when tables are large enough, and you rely on real-time data for analytical reports, you should try columnstore.

Building Our Dataset

Let's assemble a decent size dataset where adding a columnstore index makes sense. The code below creates a database with one table containing 100 million rows. I know it's not the 100 billion Dr. Evil asked for, but it's a start. On my machine, this took about five minutes to run.

USE [master];
GO

-- Checking to see if our database exists and if it does drop it.
IF DATABASEPROPERTYEX(   'BigDatabase',
                         'Version'
                     ) IS NOT NULL
BEGIN
    ALTER DATABASE BigDatabase SET SINGLE_USER WITH ROLLBACK IMMEDIATE;
    DROP DATABASE BigDatabase;
END;
GO

-- Make sure you have at least 20GB to follow along.
CREATE DATABASE BigDatabase
ON PRIMARY
       (
           NAME = N'BigDatabase',
           FILENAME = N'C:\code\MSSQLTips\BigDatabase2022.mdf',
           SIZE = 10000000KB,
           FILEGROWTH = 500000KB
       )
LOG ON
    (
        NAME = N'BigDatabase_log',
        FILENAME = N'C:\code\MSSQLTips\BigDatabase2022_log.ldf',
        SIZE = 3500000KB,
        FILEGROWTH = 1000000KB
    );
GO


ALTER DATABASE BigDatabase SET RECOVERY SIMPLE;
GO

USE BigDatabase;
GO

CREATE TABLE dbo.SalesTrackingHistory
(
    Id INT IDENTITY(1, 1),
    SalesCode INT NOT NULL,
    CustomerId INT NOT NULL,
    OrderId INT NOT NULL,
    TrackingId INT NOT NULL,
    TrackingCount INT NOT NULL,
    SalesDate DATE NOT NULL
);
GO

SELECT TOP (10000000)
       Number = CONVERT(   INT,
                           ROW_NUMBER() OVER (ORDER BY s1.object_id)
                       )
INTO dbo.Numbers
FROM sys.all_objects AS s1
    CROSS JOIN sys.all_objects AS s2
    CROSS JOIN sys.all_objects AS s3;
GO
CREATE UNIQUE CLUSTERED INDEX CIX_Number
ON dbo.Numbers ([Number])
WITH (FILLFACTOR = 100);


DECLARE @RunCount INT = 0;
DECLARE @RowCount INT = 0;

WHILE @RunCount < 10
BEGIN
    INSERT INTO dbo.SalesTrackingHistory
    (
        SalesCode,
        CustomerId,
        OrderId,
        TrackingId,
        TrackingCount,
        SalesDate
    )
    SELECT TOP 10000000
           ABS(CHECKSUM(NEWID()) % 1000) + 10 AS SalesCode,
           ABS(CHECKSUM(NEWID()) % 2000) + 20 AS CustomerId,
           ABS(CHECKSUM(NEWID()) % 3000) + 30 AS OrderId,
           ABS(CHECKSUM(NEWID()) % 4000) + 40 AS TrackingId,
           ABS(CHECKSUM(NEWID()) % 100000) + 1000 AS TrackingCount,
           DATEADD(   DAY,
                      RAND(CHECKSUM(NEWID())) * (1 + DATEDIFF(   DAY,
                                                                 '01/01/2016',
                                                                 '08/31/2023'
                                                             )
                                                ),
                      '01/01/2016'
                  ) AS SalesDate
    FROM dbo.Numbers AS n;

    SET @RowCount = @RowCount + @@ROWCOUNT;
    SET @RunCount = @RunCount + 1;

    CHECKPOINT;

    RAISERROR(   '%i Million Rows Inserted, Almost There!',
                 10,
                 1
             ,@RowCount
             ) WITH NOWAIT;

END;
GO

A Closer Look at Storage

As mentioned above, columnstore uses the columnar data format, saving columns together. SQL Server keeps groups of around 1 million rows in rowgroups. The rowgroups consist of all the columns from the columnstore index. Once the rowgroups meet the qualifications for compression, SQL breaks them into column chunks called segments. When off-row, SQL stores segments in the same location as XML data or NVARCHAR(MAX).

An interesting fact about segments is that SQL stores the min and max values in each one. For example, let's say your segment contains dates. If the minimum value is 01-01-2023 and the maximum is 01-31-2023, those values are imprinted on the segments. These values are critical when it comes to segment elimination.

Why Order Matters

Suppose I asked you what's a surefire way to reduce the number of pages SQL returns in a query. Your answer might be to add an index and place a WHERE clause in the statement. When SQL orders pages, it knows where specific values reside. To eliminate segments, we need to follow the same guidelines. By default, a columnstore index isn't ordered. That's the main problem in trying to achieve segment elimination. There's a workaround to ensure ordering, at least when starting, which I've used many times. The first statement below creates a rowstore index, then I create a columnstore index and drop the original. Microsoft doesn't officially support this, but it works well enough.

CREATE NONCLUSTERED INDEX NCCSI_SalesTrackingHistory
ON dbo.SalesTrackingHistory (
                                SalesDate,
                                CustomerId
                            )
WITH (MAXDOP = 1);

CREATE NONCLUSTERED COLUMNSTORE INDEX NCCSI_SalesTrackingHistory
ON dbo.SalesTrackingHistory (
                                SalesDate,
                                CustomerId
                            )
WITH (DROP_EXISTING = ON, MAXDOP = 1);

When you order segments, SQL eliminates ones that don't contain the data we define in the WHERE clause. This feature allows SQL to return less data, which is always good unless your goal is to make queries run slower. Now let's look at a feature in SQL Server 2022 that allows you to order your columnstore index.

SQL Server 2022 Ordered Columnstore Index

SQL Server 2022 offers an ordered columnstore index. You can even include multiple columns, like a rowstore index. Like those late-night infomercials, it seems too good to be true. Let's look at the syntax:

CREATE CLUSTERED COLUMNSTORE INDEX CCSI_SalesTrackingHistory
ON dbo.SalesTrackingHistory ORDER(SalesDate)
WITH (MAXDOP = 1);

You need to be on SQL Server 2022 and set your database compatibility level to 160 for this to work. Now that we have created our ordered columnstore index, let's run a statement and see if SQL skips any segments.

SET STATISTICS IO ON;
SELECT SUM(CAST(TrackingCount AS BIGINT)),
       YEAR(SalesDate)
FROM dbo.SalesTrackingHistory
WHERE SalesDate
BETWEEN '01-01-2022' AND '12-31-2022'
GROUP BY YEAR(SalesDate);
Statistics IO segment skipped

You can see from the screenshot above that SQL skipped most of the segments. Mission accomplished.

You can use the handy query below to look at the metadata further. The bulk of the valuable information comes from the DMV column_store_segments.

-- This statement provides information about our segments.
SELECT t.[Name] AS TableName,
       i.[Name] AS IndexName,
       c.[Name] AS ColumnName,
       se.segment_id AS SegmentId,
       se.row_count AS SegmentRowCount,
       se.min_data_id AS MinRowValue,
       se.max_data_id AS MaxRowValue
FROM [sys].[column_store_segments] AS se
    INNER JOIN [sys].[partitions] AS p
        ON p.hobt_id = se.hobt_id
    INNER JOIN [sys].[indexes] AS i
        on i.OBJECT_ID = p.OBJECT_ID
           AND i.index_id = p.index_id
    INNER JOIN [sys].[index_columns] AS ic
        ON ic.OBJECT_ID = i.OBJECT_ID
      AND ic.index_id = i.index_id
      AND ic.index_column_id = se.column_id
    INNER JOIN [sys].[columns] AS c
        ON c.OBJECT_ID = p.OBJECT_ID
           AND c.column_id = ic.column_id
   INNER JOIN [sys].[tables] AS t
      ON t.object_id = i.object_id
WHERE c.[Name] = 'SalesDate';
GO

Limitations of Order

For me, the most significant limitation of the new ordered index is not being able to create one as a nonclustered. For transactional OLTP-type environments, you don't commonly use clustered columnstore indexes. When you oversee an environment with a hybrid transactional and analytical system, nonclustered columnstore indexes can help.

When you perform a rebuild, the operation needs to be offline. One of the nice features of SQL Server 2019 was the online rebuild of columnstore indexes, which means your queries didn't need to wait until the operation completes. Plus, Microsoft still recommends using MAXDOP of 1 when rebuilding the index, so it could take a while if your index has billions of rows. However, the wait time may be fine if you only refresh data once or twice daily.

There are limitations in SQL Server's ability to sort the columnstore index accurately. Since SQL Server uses Tempdb to perform the sort operation, if you need to spill to disk, SQL uses a soft sort. According to Edward Pollack, a soft sort will sort as many rows as possible, but when space runs out, it will stop. Said another way, there's potential for your ordered columnstore index not to be ordered.

Even Microsoft documentation highlights that loading data into an ordered index takes longer. The documentation states that loading data into an ordered CCI table can take longer than a non-ordered CCI table because of the data sorting operation; however, queries can run faster afterward with an ordered CCI.

Final Word

As you may have guessed, I doubt I'll use an ordered clustered columnstore index for most environments I work in anytime soon. If you work mainly with analytical systems or something like Azure Synapse, then having the ability to order your clustered columnstore indexes might prove invaluable.

There are currently several downsides to ordering. Microsoft even mentions the fact of bloating Tempdb when the rebuild happens. Their workaround is to scale up the pool and, after the rebuild is completed, scale down, which I guess is a workaround. Ultimately, I highly recommend testing and seeing if adding the ordered option benefits your environment. What about you? Are you currently using an ordered clustered columnstore index or plan to in the future?

Key Takeaways

  • SQL Server 2022 now offers the ability to order one or more columns when creating a clustered columnstore index.
  • The rebuild operation for an ordered clustered columnstore index is offline.
  • Microsoft suggests using MAXDOP = 1 when creating and rebuilding to increase the likelihood of segment alignments. This setting will likely increase the time the operation takes.
  • Unless your environment is mainly analytical, I do not believe creating an ordered clustered columnstore index will be beneficial.
Next Steps


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Jared Westover Jared Westover is a passionate technology specialist at Crowe, helping to build data solutions. For the past two decades, he has focused on SQL Server, Azure, Power BI, and Oracle. Besides writing for MSSQLTips.com, he has published 12 Pluralsight courses about SQL Server and Reporting Services. Jared enjoys reading and spending time with his wife and three sons when he's not trying to make queries go faster.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips


Article Last Updated: 2023-06-19

Comments For This Article

















get free sql tips
agree to terms