Do not rely solely on a SQL Server Clustered Index Seek when Optimizing a Query

By:   |   Updated: 2024-02-07   |   Comments (2)   |   Related: More > Performance Tuning


Problem

If you have a list of user IDs and need to retrieve their detailed information from a table with a clustered primary key on the ID column, it's important to note that relying solely on the clustered index seek operation may not guarantee optimal performance, particularly in high concurrency scenarios. Even if SQL Server performs a clustered index seek operation and the estimated number of rows matches the actual number of rows, other factors may contribute to slow query performance in such situations.

Solution

Clustered indexes sort and store rows of data in the table or view based on their key values. Unlike non-clustered indexes, a table can have only one clustered index. When a clustered index is created on a table, the data rows in the table are physically ordered based on the values of the clustered index. This means the data is stored in the same order as the clustered index, which can benefit certain types of queries.

A clustered index is a powerful feature in database systems that enables efficient data retrieval. However, it should be chosen and implemented carefully, considering the specific requirements and usage patterns of the table. A clustered index seek is an operation performed by the database engine to locate and retrieve data from a table based on a clustered index. When a query specifies a condition that matches the indexed columns, a clustered index seek is an efficient way to find the desired data. However, it is important to note that a clustered index seek operation is not always optimized in every scenario.

Set Up Test Environment

For this demonstration, I will use SQL Server 2022 and a StackOverflow database. The StackOverflow database has a table called the users table. It holds information about users. The users table has a clustered primary key on the ID column, which is an identity column with an integer data type. The ID column is incremental, unique, narrow, and static, making it suitable as a clustered index key. This is particularly useful for joining the users table to other tables, such as the post table.

Suppose we have a list of user IDs and want to construct a query that returns their detailed information. We can create a temporary table and populate it with the desired user IDs for simulation purposes.

Use master
GO

Alter Database StackOverflow Set Compatibility_Level = 160
GO

Use StackOverflow
GO

Create Table #Ids
(Id Int Not Null)
GO

Insert Into #Ids
Select Top 100000 Id From dbo.Users Where Id % 7 = 1
GO

Having created a list of user IDs, let's write a query to show comprehensive information about the users. To get IO statistics, use the command below:

SET STATISTICS IO ON
GO

To view the actual execution plan, simply press Ctrl + M.

Select u.* From dbo.Users u
Inner Join #Ids i
 On u.Id = i.Id

The image below shows that SQL Server has performed a clustered index seek operation. As you can see, the estimated number of rows is the same as the actual number of rows for all the operations in the actual execution plan.

Clustered Index Seek

Everything seems fine. However, the next screenshot shows that the number of logical reads is over three hundred thousand.

Clustered Index Seek Logical Reads

How many pages are there in the users table? The query and the image below show that the users table has about ninety thousand pages.

SELECT OBJECT_NAME(s.object_id) AS [TableName],
       s.row_count AS [RowCount],
       s.used_page_count AS [UsedPages],
       s.reserved_page_count AS [ReservedPages]
FROM sys.dm_db_partition_stats s
    JOIN sys.tables t
        ON s.object_id = t.object_id
WHERE OBJECT_NAME(s.object_id) = 'Users';
GO
TheNumberofPages

So why did SQL Server read more than three hundred thousand pages to execute the query? Look at the NESTED LOOP JOIN operator shown in the image below.

nested loop join operator

A NESTED LOOP JOIN is the simplest JOIN algorithm, which, like all JOIN types, requires two inputs known as outer and inner tables. A NESTED LOOP JOIN is executed as a nested loop, where the inner table is searched multiple times, once for each qualifying row in the outer table. Even if only a small percentage of rows meet the criteria, reading a large table involves processing a significant number of rows. Performing a few hundred thousand reads in an I/O bound system can have a significant negative impact on performance.

Let's rewrite the query:

Create Table #MinMaxIDs
(MinId Int Not Null, MaxId Int Not Null)
GO

Insert Into #MinMaxIDs
Select Min(Id) AS MinId, Max(Id) As MaxId From #Ids
GO

Select u.* From dbo.Users u
Inner Join #Ids i
 On u.Id = i.Id
Inner Join #MinMaxIDs mmi
 On u.Id Between mmi.MinId And mmi.MaxId
GO

As depicted in the image below, the number of logical reads has been reduced to approximately eleven thousand. That's great.

Clustered Index Seek Logical Reads_2

The image below illustrates the actual execution plan of the query.

Hash join operator

Here is a zoomed-in view of the previous image, focusing on the hash match operator.

Hash Join

A hash join reads each table only once. First, it reads the outer table to create hash buckets. Then, it reads the inner table and checks the hash buckets for matching rows. CPU resources are required to calculate the hash function and maintain the contents of the buckets. Memory resources are essential for storing the hashed data. However, in cases where your system is I/O bound, you may have excess memory and CPU resources. In such scenarios, the optimizer may consider a hash join a viable option, particularly if your I/O resources are limited and you need to join large tables.

Summary

Several metrics can be used to measure the performance of a query – one being the number of logical reads. A logical read occurs each time the database engine requests a page from the buffer cache. Optimizing queries to reduce the number of logical reads can improve performance. No plan operator is always bad. A NESTED LOOP operation requires multiple searches of the inner table. A HASH JOIN, on the other hand, does not do multiple reads of either table. The superiority of one approach over another depends on the specific characteristics and requirements of the query at hand.

Next Steps


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Mehdi Ghapanvari Mehdi Ghapanvari is an SQL Server database administrator with 6+ years of experience. His main area of expertise is improving database performance. He is skilled at managing high-performance servers that can handle several terabytes of data and hundreds of queries per second, ensuring their availability and reliability.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips


Article Last Updated: 2024-02-07

Comments For This Article




Monday, February 12, 2024 - 11:16:53 AM - Mehdi Ghapanvari Back To Top (91941)
This article aims to reduce the number of logical reads. The number of logical reads is 300,000 for the first query and only 11,000 for the second query. The article's title may not be appropriate. I agree. The original title 'Do not rely solely on the clustered index seek' has been changed and is now updated to better reflect the article.

Saturday, February 10, 2024 - 11:01:02 AM - Erik Darling Back To Top (91934)
I don't get it. The second query is only ~200ms faster (1.329ms vs 1.124ms), and there's still a clustered index seek in the query plan.