SQL Server Insert Performance for Clustered Indexes vs. Heap Tables

By:   |   Comments (7)   |   Related: > Indexing


Problem

I've read many SQL Server best practice documents that state every table should have a clustered index as opposed to leaving it as a heap structure with nonclustered indexes.  Most of the documentation notes the administrative benefits of using clustered indexes.  Are there any performance implications, either positive or negative, from implementing a clustered index on a table?

Solution

In order to test the performance of each of these types of table/index structures we will create two identical tables with one difference.   One will have a primary key created with a clustered index and the other will have the primary key created using a nonclustered index so the table data will stay in the heap structure.  Below is the T-SQL to create and load sample data into these tables.

-- Create table and indexes
CREATE TABLE testtable ([col1]  [int] NOT NULL PRIMARY KEY CLUSTERED,
                        [col2] [int]  NULL,
                        [col3] [int]  NULL,
                        [col4] [varchar](50) NULL,
                        [col5] uniqueidentifier); 
 
-- Load sample data into table
DECLARE @val INT
SELECT @val=1
WHILE @val < 5000000
BEGIN  
   INSERT INTO testtable (col1, col2, col3, col4, col5) 
       VALUES (@val,round(rand()*100000,0),
               round(rand()*100000,0),'TEST' + cast(@val AS VARCHAR), newid())
   SELECT @val=@val+1
END
GO
 
-- Create sample table and indexes
CREATE TABLE testtable2 ([col1]  [int] NOT NULL PRIMARY KEY NONCLUSTERED,
                         [col2] [int]  NULL,
                         [col3] [int]  NULL,
                         [col4] [varchar](50) NULL,
                         [col5] uniqueidentifier); 
 
INSERT INTO testtable2 SELECT * FROM testtable;
   

One thing I noticed right away was that the heap table was using more space since the table and index are separate structures.  With the clustered index the data is stored in the leaf pages of the index so less space is used.  Below is a chart that shows how much space is used by each object after the initial load.

TableName IndexName Used (KB) Reserved (KB) Rows
testtable PK__testtabl__357D0D3E3D086A66 257952 257992 4999999
testtable2 HEAP 256992 257032 4999999
testtable2 PK__testtabl__357D0D3F2CBA35D8 89432 89608 4999999

Now that we have some data loaded let's perform another data load and monitor the performance using SQL Profiler and check the space usage after the inserts complete.  Below is the T-SQL to load another 100,000 records each table.

-- insert when all pages are full
DECLARE @val INT
SELECT @val=5000000
WHILE @val < 5100000
BEGIN  
   INSERT INTO testtable (col1, col2, col3, col4, col5) 
       VALUES (@val,round(rand()*100000,0),
               round(rand()*100000,0),'TEST' + cast(@val AS VARCHAR), newid())
   SELECT @val=@val+1
END
GO
 
DECLARE @val INT
SELECT @val=5000000
WHILE @val < 5100000
BEGIN  
   INSERT INTO testtable2 (col1, col2, col3, col4, col5) 
       VALUES (@val,round(rand()*100000,0),
               round(rand()*100000,0),'TEST' + cast(@val AS VARCHAR), newid())
   SELECT @val=@val+1
END
GO
   

Looking first at the space usage as one would expect (since the pages were all full ie. nothing has been deleted from the table) each object is using proportionately more space. Both the heap and clustered index increased in size by almost the same amount.

TableName IndexName Used (KB) Reserved (KB) Rows
testtable PK__testtabl__357D0D3E3D086A66 263128 263176 5099999
testtable2 HEAP 262392 262472 5099999
testtable2 PK__testtabl__357D0D3F2CBA35D8 91216 91272 5099999

A little more interesting is the SQL Profiler output.  Due to the fact that there are two objects to be updated the nonclustered index and heap table insert requires slightly more CPU and performs more reads and writes than the clustered index insert.  It also takes slightly longer to complete.

Index Type CPU (ms) Reads Writes Duration (ms)
CLUSTERED 3500 304919 654 11288
HEAP 3890 406083 904 11438

Let's now randomly delete some data and then run another 100,000 record insert to see if having free space on some of the pages in the table makes any difference with performance.  Below is the T-SQL to perform both the delete and the data load.

-- remove 1000000 random records from each table
DELETE FROM testtable WHERE col1 in (SELECT TOP 1000000 col1 FROM testtable ORDER BY newid());
DELETE FROM testtable2 WHERE col1 not in (SELECT col1 FROM testtable);
GO
 
-- insert when there is free space in pages
DECLARE @val INT
SELECT @val=5100000
WHILE @val < 5200000
BEGIN  
   INSERT INTO testtable (col1, col2, col3, col4, col5) 
       VALUES (@val,round(rand()*100000,0),
               round(rand()*100000,0),'TEST' + cast(@val AS VARCHAR), newid())
   SELECT @val=@val+1
END
GO
 
DECLARE @val INT
SELECT @val=5100000
WHILE @val < 5200000
BEGIN  
   INSERT INTO testtable2 (col1, col2, col3, col4, col5) 
       VALUES (@val,round(rand()*100000,0),
               round(rand()*100000,0),'TEST' + cast(@val AS VARCHAR), newid())
   SELECT @val=@val+1
END
GO
   

Looking at the SQL Profiler output we see a much larger performance gap between the two types of structures this time with the table with the clustered index having much better performance.

Index Type CPU (ms) Reads Writes Duration (ms)
CLUSTERED 3562 304859 653 10334
HEAP 4973 422142 7053 13042

This difference is most likely due to the fact that when inserting data into a heap table the database engine will search for empty space within each page to store the new data being inserted.  This is done since the data in the heap table is unsorted so it can go anywhere.  The table with the clustered index simply has data appended to the end of the clustered index since the primary key is increasing.   Let's also take a look at the space usage for the table.  If the heap structure is indeed reusing space it should not have grown any more since we deleted 1,000,000 records and only inserted 100,000.  Below is the table sizes after this last batch of inserts completed.  We can see here that the size of the heap table has not changed confirming our earlier assumption.

TableName IndexName Used (KB) Reserved (KB) Rows
testtable PK__testtabl__357D0D3E3D086A66 268304 268360 4199999
testtable2 HEAP 262392 262472 4199999
testtable2 PK__testtabl__357D0D3F2CBA35D8 93008 93064 4199999 

Summary

In this test case we showed that using a clustered index gave us better performance in every category that we measured, CPU, IO and total duration. The only side effect of using a clustered index was that it did use a little bit more space.  This space could also be reclaimed by rebuilding the index during a maintenance window (or online if you are running Enterprise Edition).  In any case, before implementing any index changes in production you should always test in your own environment to ensure you also see a positive improvement in performance.

Next Steps


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Ben Snaidero Ben Snaidero has been a SQL Server and Oracle DBA for over 10 years and focuses on performance tuning.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips



Comments For This Article




Friday, November 10, 2017 - 3:00:36 AM - The New Guy Back To Top (69506)

 

 Thanks for this post, very helpful to improve my future code!

Cheers


Monday, November 6, 2017 - 10:33:52 PM - Bill Back To Top (69335)

What you are demonstrating is the overhead of an index, not clustered vs heap.  When you created the primary key on a heap, you mearly created an index but when you created a clustered primary key then you just have the table.  To be more accurate, you should add a non clustered index to the clustered table so that both tables have an index.

Anyways, this test still would not be valid because you have to configure tables and indexes according to your data and usage patterns.  We don't always run off the clustered index or primary key, so in that case we use a non clustered index.  When possible we include all needed columns in the index so the table is not even accessed making it mute if it were clustered or not, and even if we do not include all columns then it takes more work to look up a record using a clustered index than an RID.  But again, if your non clustered index happens to pull data that is within ranges of the clustered index then it is more likely a lot of data resides in the same pages thus giving you fewer reads.  Your choice of heap of not is determined by getting the best performance, and that can be by heap or clustered but not always nessesarily clustered.


Monday, October 23, 2017 - 5:10:23 PM - Charles N Burns Back To Top (68706)

 This was interesting -- I always like to see real performance data -- but has a number of issues and greatly oversimplifies the matter. For example, index lookups for non-clustered columns suffer an extra degree of indirection, affecting performance for any query that filters on any row except the clustered index.

Clustered tables are also less efficient for lookups if the clustered key is not chosen very, very carefully. Are most table lookups based on the clustered key? If so, it can provide a significant performance boost. If not, performance drops (due to the indirection).

The problem with the above is that tables are often queried based on a wide variety of column combinations, so it is impossible to choose a good clustered key.

 See, among other sources: http://use-the-index-luke.com/sql/clustering/index-organized-clustered-index


Friday, July 7, 2017 - 11:03:34 AM - Chris Back To Top (59018)

Hello,

Thank you for this article. It is definitely a good discussion starting point, but I agree more with the other commenters. As Uwe pointed out, you're inserting into two structures for the heap and only one for the clustered index. And, as sheen pointed out, you're not necessarily grabbing and inserting into the middle of those structures when you're performing your inserts and deletes.

The other thing I noticed is that you're also assuming that your primary key is an increasing integer. What if you have a composite primary key based off of birthdate and zip code? I know that a solid primary key will be incremental, but that is not always the case. I have seen plenty of code that used non-sequential GUIDs as the primary key. That would almost certainly slow things down for the clustered index inserts, as it has to make the comparisons and then insert new data pages or move data to other pages as the new data that falls between ranges is added.  There are plenty of online discussions regarding the creation of primary and foreign keys, so I don't want to get into that here, but you must also realize that not every clustered index will be in a nice, tidy, incremental format.

I appreciate the work you've done here for this article. Thank you for sharing.


Friday, July 7, 2017 - 10:07:24 AM - Ben Snaidero Back To Top (59013)

 @sheen

I would assume performance would be similar to the delete/insert test case I did above where I randomly deleted data since for the heap it would still search for free space and then update the index accordingly and the clustered index delete/insert would be exactly the same as my case.  You could test this theory yourself using the setup script from the tip to confirm.

Thanks for reading


Friday, July 7, 2017 - 8:58:20 AM - sheen Back To Top (59010)

 Hi,

Thanks for doing this comparision.

What is the performance result if you try to delete and insert some data in the middle of the range (like < 5mil) ? Don't delete and insert the same data.

 


Friday, July 7, 2017 - 8:15:28 AM - Uwe Ricken Back To Top (59006)

 

 Hi Ben,

interesting article but to be honest - you are dealing with different scenarios. To have a really comparision between CI and HEAP you shouldn't have an additonal index on the heap. :)

The CI will have only ONE insert but the HEAP has to deal with the insert in the table AND the NCI.

From my experience a heap can be so much better than a CI if you distribute the table on different files in a file group. The real hotspot in a HEAP is the PFS.















get free sql tips
agree to terms