Be Careful with Key Order in SQL Server Missing Index Recommendations

Problem

I have been working with SQL Server for a long time, but I’m still constantly learning new things. Generally, its missing index recommendations can be useful at a high level, and some systems rely on them heavily. I haven’t spent a ton of time analyzing them because of the limitations in what they offer, such as often suggesting redundant indexes, never offering filtered indexes, and recommending indexes for queries that have run exactly once. And sometimes, these indexes are not created correctly, specifically in terms of the order of key columns. Not too long ago, I learned why.

Solution

It turns out that SQL Server just orders key columns based on their ordinal position in sys.columns. Let’s look at a simple example; we’ll create a table (with, on my system, about 85,000 rows):

CREATE TABLE dbo.DateFirst
(
  id int IDENTITY(1,1) NOT NULL,
  dt datetime NOT NULL,
  ln nvarchar(200) NOT NULL,
  CONSTRAINT pk_DateFirst PRIMARY KEY (id)
);
;WITH x AS
(  
  SELECT 
    dt = CONVERT(date,DATEADD(DAY,1-(o.[object_id]/10.0 * column_id) % 1000, modify_date)),
    ln = LEFT(o.name + c.name, 200)
  FROM sys.all_objects AS o
  INNER JOIN sys.all_columns AS c
  ON o.[object_id] BETWEEN c.[column_id] - 20 AND c.[column_id] + 20
)
INSERT dbo.DateFirst(dt,ln) 
SELECT dt = MAX(dt), ln FROM x GROUP BY ln;

If we run a simple query against this table, where we use equality predicates on both the datetime and string columns, we get the following index recommendation:

SELECT id FROM dbo.DateFirst
  WHERE dt = '2018-05-11'
    AND ln = N'sysallocunitsabort_state'
    AND 1 > (SELECT 0); -- to make optimization non-trivial
/*
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[DateFirst] ([dt],[ln])-- order of table
*/

You might conclude that it just listed them that way because of the order of the predicates. But if you reverse the order, putting the lncheck before dt, the exact same index is suggested:

SELECT id FROM dbo.DateFirst
  WHERE ln = N'sysallocunitsabort_state'
    AND dt = '2018-05-11'
    AND 1 > (SELECT 0); -- to make optimization non-trivial/*
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[DateFirst] ([dt],[ln])-- order of table
*/

It can actually get a little more complex than that, because the key columns will be divided into two sets: those that satisfy equality predicates, and those that satisfy inequality predicates. If we change one of the predicates to be inequality, we now get the key columns listed in the opposite order (but really, they are two groups of columns listed in order of equality and then inequality):

SELECT id FROM dbo.DateFirst
  WHERE dt <> '2018-05-11'
    AND ln = N'sysallocunitsabort_state'
    AND 1 > (SELECT 0);
/*
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[DateFirst] ([ln],[dt])-- NOT order of table  
*/

For fabricated data like this, it may not really matter. The point is that a recommended index in a real world scenario might be much better suited to having a specific leading key column, and it may not reflect the order of the columns in the table.

Let’s create a new version of the table with those columns flipped, and insert the same data:

CREATE TABLE dbo.DateLast
(
  id int IDENTITY(1,1) NOT NULL,
  ln nvarchar(200) NOT NULL,
  dt datetime NOT NULL,
  CONSTRAINT pk_DateLast PRIMARY KEY(id)
);
INSERT dbo.DateLast(ln,dt) SELECT ln,dt FROM dbo.DateFirst;

Now, we’ll run similar queries, and see the recommendations we get:

SELECT id FROM dbo.DateLast
  WHERE dt = '2018-05-11'
    AND ln = N'sysallocunitsabort_state'
    AND 1 > (SELECT 0);
/*
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[DateLast] ([ln],[dt])-- order of table
*/
SELECT id FROM dbo.DateLast
  WHERE ln = N'sysallocunitsabort_state'
    AND dt = '2018-05-11'
    AND 1 > (SELECT 0);
/*
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[DateLast] ([ln],[dt])-- order of table
*/
SELECT id FROM dbo.DateLast
  WHERE ln <> N'sysallocunitsabort_state'
    AND dt = '2018-05-11'
    AND 1 > (SELECT 0);
/*
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[DateFirst] ([dt],[ln])-- NOT order of table
*/

Again, when all predicates use equality against a column, they’re all treated equally, and so SQL Server falls back to sorting by ordinal position in the table itself. Only when we introduce inequality against a column does it treat that column differently. And it is easy to prove that if all columns use inequality, the order would again fall back to ordinal position within the table.

You would think that important factors like density and selectivity would come into play, but they don’t. If you were creating these indexes yourself, and you knew your data, you would look at the predicates and deduce that the leading key column should usually be the one that is most selective, since it eliminates the largest portion of the B-tree the quickest.

Credit goes to Brent Ozar and Bryan Rebok, who recently solved this mystery in this Stack Exchange question a question on Database Administrators Stack Exchange.

Summary

So, what is the fix? I don’t know. Heuristics are tough, and sometimes the correct index structure depends on future information that SQL Server doesn’t even know yet. All I can suggest is that you pay particular attention to the key column order when you are creating indexes that come from individual index recommendations and, more importantly, the Database Engine Tuning Advisor (DTA).

Next Steps

Read on for related tips and other resources:

Aaron Bertrand

Aaron Bertrand (@AaronBertrand) is a passionate technologist with industry experience dating back to Classic ASP and SQL Server 6.5. He also blogs at sqlblog.org.

MSSQLTips Awards: Author of the Year – 2016, 2023 | Leadership (200+ tips) – 2022

Summary

Leave a ReplyCancel Reply