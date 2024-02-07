By: Mehdi Ghapanvari | Updated: 2024-02-07 | Comments (2) | Related: More > Performance Tuning

Problem

If you have a list of user IDs and need to retrieve their detailed information from a table with a clustered primary key on the ID column, it's important to note that relying solely on the clustered index seek operation may not guarantee optimal performance, particularly in high concurrency scenarios. Even if SQL Server performs a clustered index seek operation and the estimated number of rows matches the actual number of rows, other factors may contribute to slow query performance in such situations.

Solution

Clustered indexes sort and store rows of data in the table or view based on their key values. Unlike non-clustered indexes, a table can have only one clustered index. When a clustered index is created on a table, the data rows in the table are physically ordered based on the values of the clustered index. This means the data is stored in the same order as the clustered index, which can benefit certain types of queries.

A clustered index is a powerful feature in database systems that enables efficient data retrieval. However, it should be chosen and implemented carefully, considering the specific requirements and usage patterns of the table. A clustered index seek is an operation performed by the database engine to locate and retrieve data from a table based on a clustered index. When a query specifies a condition that matches the indexed columns, a clustered index seek is an efficient way to find the desired data. However, it is important to note that a clustered index seek operation is not always optimized in every scenario.

Set Up Test Environment

For this demonstration, I will use SQL Server 2022 and a StackOverflow database. The StackOverflow database has a table called the users table. It holds information about users. The users table has a clustered primary key on the ID column, which is an identity column with an integer data type. The ID column is incremental, unique, narrow, and static, making it suitable as a clustered index key. This is particularly useful for joining the users table to other tables, such as the post table.

Suppose we have a list of user IDs and want to construct a query that returns their detailed information. We can create a temporary table and populate it with the desired user IDs for simulation purposes.

Use master GO Alter Database StackOverflow Set Compatibility_Level = 160 GO Use StackOverflow GO Create Table #Ids (Id Int Not Null) GO Insert Into #Ids Select Top 100000 Id From dbo.Users Where Id % 7 = 1 GO

Having created a list of user IDs, let's write a query to show comprehensive information about the users. To get IO statistics, use the command below:

SET STATISTICS IO ON GO

To view the actual execution plan, simply press Ctrl + M.

Select u.* From dbo.Users u Inner Join #Ids i On u.Id = i.Id

The image below shows that SQL Server has performed a clustered index seek operation. As you can see, the estimated number of rows is the same as the actual number of rows for all the operations in the actual execution plan.

Everything seems fine. However, the next screenshot shows that the number of logical reads is over three hundred thousand.

How many pages are there in the users table? The query and the image below show that the users table has about ninety thousand pages.

SELECT OBJECT_NAME(s.object_id) AS [TableName], s.row_count AS [RowCount], s.used_page_count AS [UsedPages], s.reserved_page_count AS [ReservedPages] FROM sys.dm_db_partition_stats s JOIN sys.tables t ON s.object_id = t.object_id WHERE OBJECT_NAME(s.object_id) = 'Users'; GO

So why did SQL Server read more than three hundred thousand pages to execute the query? Look at the NESTED LOOP JOIN operator shown in the image below.

A NESTED LOOP JOIN is the simplest JOIN algorithm, which, like all JOIN types, requires two inputs known as outer and inner tables. A NESTED LOOP JOIN is executed as a nested loop, where the inner table is searched multiple times, once for each qualifying row in the outer table. Even if only a small percentage of rows meet the criteria, reading a large table involves processing a significant number of rows. Performing a few hundred thousand reads in an I/O bound system can have a significant negative impact on performance.

Let's rewrite the query:

Create Table #MinMaxIDs (MinId Int Not Null, MaxId Int Not Null) GO Insert Into #MinMaxIDs Select Min(Id) AS MinId, Max(Id) As MaxId From #Ids GO Select u.* From dbo.Users u Inner Join #Ids i On u.Id = i.Id Inner Join #MinMaxIDs mmi On u.Id Between mmi.MinId And mmi.MaxId GO

As depicted in the image below, the number of logical reads has been reduced to approximately eleven thousand. That's great.

The image below illustrates the actual execution plan of the query.

Here is a zoomed-in view of the previous image, focusing on the hash match operator.

A hash join reads each table only once. First, it reads the outer table to create hash buckets. Then, it reads the inner table and checks the hash buckets for matching rows. CPU resources are required to calculate the hash function and maintain the contents of the buckets. Memory resources are essential for storing the hashed data. However, in cases where your system is I/O bound, you may have excess memory and CPU resources. In such scenarios, the optimizer may consider a hash join a viable option, particularly if your I/O resources are limited and you need to join large tables.

Summary

Several metrics can be used to measure the performance of a query – one being the number of logical reads. A logical read occurs each time the database engine requests a page from the buffer cache. Optimizing queries to reduce the number of logical reads can improve performance. No plan operator is always bad. A NESTED LOOP operation requires multiple searches of the inner table. A HASH JOIN, on the other hand, does not do multiple reads of either table. The superiority of one approach over another depends on the specific characteristics and requirements of the query at hand.

