SQL Server 2016 ColumnStore Index String Predicate Pushdown

By: Ahmad Yaseen | Comments | Related: > Indexing

Problem

When designing a data warehouse, it is recommended to use numeric or integer keys to filter records or in the join operator predicates, rather than using string columns. Applying these predicates on string values is very expensive and will significantly slow down your analytical queries, especially when querying large tables in a data warehouse. However, most of the predicates on data warehouses are performed using string values and Microsoft is aware of that during their customer visits. SQL Server 2016 comes with significant enhancements to the Columnstore index functionality and performance. One of the improvements to the Columnstore index is the String Predicate Pushdown. What is this technology and how will it simplify using string values in predicates?

Solution

The SQL Server Columnstore index compression strategy depends mainly on the Dictionaries concept for string datatypes. There are two types of the dictionaries. The Local Dictionary that serves mainly one single segment of a specific column, and the Global Dictionary that serves all segments available for a specific column. The Dictionary stores the full column value, where the Segment stores a reference to that dictionary entry. For the repeated column values, one copy of these values will be stored in the dictionary, but referenced by the segment multiple times. To check if a specific row fulfills the search criteria, the string comparison will be performed at the dictionary level within the Columnstore index, comparing only one value instead of comparing each value separately.

Starting from SQL Server 2016, the Columnstore index string predicates can be pushed down to the SCAN node, minimizing the string comparisons and improving query performance significantly. In the string predicate pushdown technique, the string predicate is performed against the values in the dictionary, which contain no duplicates, reducing the number of rows that will be involved in the comparison process, and the number of records that will be flown out of the SCAN node. For the dictionary values that meet the predicate, all rows referring to that dictionary value will be returned by the query.

In order to understand the String Predicate Pushdown technique, let us do a practical demo. We will create two new tables under the MSSQLTipsDemo testing database, the Employee_Department table and the Employees table that has a clustered Columnstore index and EmpDepID column as a foreign key from the Employee_Department table. The below script is used to create the two new tables:

USE [MSSQLTipsDemo]
GO

CREATE TABLE [dbo].[Employee_Department](
   [DepID] [int] NOT NULL,
   [DepName] [varchar](50) NOT NULL,
PRIMARY KEY CLUSTERED 
(
   [DepID] ASC
)) ON [PRIMARY]
GO

CREATE TABLE [dbo].[Employees](
   [EmpID] [int] NOT NULL,
   [EmpName] [varchar](50) NOT NULL,
   [EmpAddress] [varchar](50) NOT NULL,
   [EmpDEPID] [int] NOT NULL,
   [EmpBirthDay] [datetime] NULL,
INDEX Employees_CCI CLUSTERED COLUMNSTORE) ON [PRIMARY]
GO

ALTER TABLE [dbo].[Employees]  WITH CHECK ADD  CONSTRAINT [FK_Employees_EmpDep] FOREIGN KEY([EmpDEPID])
REFERENCES [dbo].[Employee_Department] ([DepID])
GO

ALTER TABLE [dbo].[Employees] CHECK CONSTRAINT [FK_Employees_EmpDep]
GO

Once the two tables are created successfully, we will fill each table with 10,000 records using the following insert statements:

USE [MSSQLTipsDemo]
GO
INSERT [dbo].[Employee_Department] ([DepID], [DepName]) VALUES (294747, N'Production Control')
GO 2000
INSERT [dbo].[Employee_Department] ([DepID], [DepName]) VALUES (309152, N'Executive')
GO 2000
INSERT [dbo].[Employee_Department] ([DepID], [DepName]) VALUES (830332, N'Tool Design')
GO 2000
INSERT [dbo].[Employee_Department] ([DepID], [DepName]) VALUES (880375, N'Human Resources')
GO 2000
INSERT [dbo].[Employee_Department] ([DepID], [DepName]) VALUES (881776, N'Tool Design')
GO 2000

INSERT [dbo].[Employees] ([EmpID], [EmpName], [EmpAddress], [EmpDEPID], [EmpBirthDay]) VALUES (1592629756, N'Violet', N'199 Court Street', 294747, CAST(N'1878-10-09T10:20:29.000' AS DateTime))
GO 2000
INSERT [dbo].[Employees] ([EmpID], [EmpName], [EmpAddress], [EmpDEPID], [EmpBirthDay]) VALUES (2061234722, N'Victoria', N'48 School Street', 309152, CAST(N'1766-12-21T05:51:49.000' AS DateTime))
GO 2000
INSERT [dbo].[Employees] ([EmpID], [EmpName], [EmpAddress], [EmpDEPID], [EmpBirthDay]) VALUES (1031404061, N'Victoria', N'23 Williams Street', 830332, CAST(N'1869-08-13T16:55:28.000' AS DateTime))
GO 2000
INSERT [dbo].[Employees] ([EmpID], [EmpName], [EmpAddress], [EmpDEPID], [EmpBirthDay]) VALUES (2014081749, N'Edward', N'23 Williams Street', 880375, CAST(N'1821-07-21T11:24:26.000' AS DateTime))
GO 2000
INSERT [dbo].[Employees] ([EmpID], [EmpName], [EmpAddress], [EmpDEPID], [EmpBirthDay]) VALUES (659745901, N'Roberto', N'208 Pine Street', 881776, CAST(N'1755-11-26T06:08:48.000' AS DateTime))
GO 2000

Using SQL Server 2014

Now the tables are ready to start the test scenarios. Assume that we have a SQL Server 2014 instance, or the database compatibility level is 120, indicating that the database is using SQL Server 2014, that can be achieved by executing the ALTER DATABASE statement below:

USE [master]
GO

ALTER DATABASE [MSSQLTipsDemo] SET COMPATIBILITY_LEVEL = 120
GO

After that, we will run the below SELECT statement after enabling the execution plan and time statistics to check the query performance:

SELECT COUNT (EMP.[EmpID]) 
     ,EMP.[EmpName]
     ,EMP.[EmpAddress]
     ,EMP.[EmpBirthDay]
     ,EMPD.[DepName]
FROM Employees EMP
JOIN Employee_Department EMPD
ON   EMP.[EmpDEPID]=EMPD.DepID
WHERE  EMP.[EmpName]  LIKE '%John%'
GROUP BY EMP.[EmpName],EMP.[EmpAddress],EMP.[EmpBirthDay] ,EMPD.[DepName]
OPTION (RECOMPILE)

The execution plan below shows that a scan operation is performed on the Columnstore index, scanning all the Employees table rows, then the data will be passed to the Filter node and finally sorted in the Sort node:

This is also clear from the Columnstore Index Scan operator properties:

The columnstore index is executed in Row execution mode.
All the table's records, 10,000 records, are scanned in the Columnstore Index Scan operator and passed to the Filter node.
The Columnstore Index Scan operator cost is 32% of the overall query cost.
The string predicate is performed at the Filter node.

The query took 181ms to execute and consumed 15ms from the CPU time as you can see from the time statistics below:

Using SQL Server 2016

Let us perform the same scenario using a SQL Server 2016 instance. We need to make sure that the database compatibility level is 130, indicating that the database is using SQL Server 2016 that can be achieved by running the ALTER DATABASE statement below:

USE [master]
GO

ALTER DATABASE [MSSQLTipsDemo] SET COMPATIBILITY_LEVEL = 130
GO

Then we will run the same previous SELECT statement as follows after enabling the execution plan and time statistics to check the query performance:

SELECT COUNT (EMP.[EmpID]) 
     ,EMP.[EmpName]
     ,EMP.[EmpAddress]
     ,EMP.[EmpBirthDay]
     ,EMPD.[DepName]
FROM Employees EMP
JOIN Employee_Department EMPD
ON   EMP.[EmpDEPID]=EMPD.DepID
WHERE  EMP.[EmpName]  LIKE '%John%'
GROUP BY EMP.[EmpName],EMP.[EmpAddress],EMP.[EmpBirthDay] ,EMPD.[DepName]
OPTION (RECOMPILE)

Checking the generated execution plan, you will see that the Filter node is no longer there, and the data is passed directly from the Columnstore Index Scan node to the join operator as follows:

It is clear from the Columnstore Index Scan operator properties that:

The columnstore index is executed in Batch execution mode.
Only 64 records, out of the 10,000 records in the Employee table, are scanned in the Columnstore Index Scan operator and passed to the Join operator directly.
The string predicate that appears in the WHERE clause and logically takes place after the join matching process is pushed down to the Columnstore Index Scan operator that is responsible for retrieving the rows from the table and flows to the Join operator.
The filtering occurs earlier within the query execution in the same SCAN node, which is clear from the absence of the Filter node.
The Columnstore Index Scan operator cost is 21% of the overall query cost.

In addition, the query took 68ms to execute in SQL Server 2016 with the String Predicate Pushdown technique, compared to 181ms in SQL Server 2014, consuming no CPU time as compared to the 15ms CPU time consumed in SQL Server 2014 as you can see from the time statistics below:

Summary

From the previous results and statistics, you can see that the new SQL Server 2016 Columnstore Index String Predicate Pushdown technique reduces the number of records to be scanned by the SCAN node, which is 0.0064 of the table records in SQL Server 2016 compared to the full table records scan performed in SQL Server 2014.

Also, rather than performing the string predicate in a separate Filter node by flowing all the table records to be filtered in SQL Server 2014, the string predicate will be pushed down to be performed in the same Columnstore Index Scan operator, simplifying and speeding up the predicating process. When speaking about speeding up the query, the time statistics shows that the time required to execute the query with the String Predicate Pushdown technique is 0.38 of the time required to execute it with no push down in SQL Server 2014. Based on these results, consider the benefits from the SQL Server 2016 Columnstore Index String Predicate Pushdown and all other new features and enhancements as you select the platform for your next data warehouse.

Next Steps

Check SQL Server 2016 Columnstore Index Enhancements - System Views for Disk-Based Tables
Check also SQL Server Column Store Index Performance
Read more about SQL Server Columnstore, B-Tree and Hybrid Index Performance Comparison for Data Warehouses
Read Also Identify the best tables for SQL Server 2016 Columnstore Index Migration
Enjoy reading Columnstore Index Recommendations using SQL Server 2016 Upgrade Advisor
Enjoy also reading SQL Server 2012 Column Store Index Example

About the author

Ahmad Yaseen is a SQL Server DBA with a bachelor’s degree in computer engineering as well as .NET development experience.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips