SQL Server Performance Comparison INT versus GUID

Problem

Over the years I have read quite a few articles about which datatypes should be used for primary keys.  At some point the conversation ends up with one side thinking an integer is a better option and the other side leaning more towards using a GUID. 

In this tip we will aim to identify any performance differences between these two datatypes.  We will also look at a couple other index characteristics associated with each type to see if there are any other benefits that might outweigh any performance differences. 

Solution

As many of my tips deal with SQL Server performance we will test the performance of these two datatypes by running through a simple test case to see if we can measure any significant performance differences.  We will setup two almost identical tables with the only variance being the datatype of the primary key column. 

Below is the T-SQL to setup and load these tables.

 CREATE TABLE testtable_int ([col1] [int] NOT NULL primary key clustered,
                            [col2] [int] NULL,
                            [col3] [int] NULL,
                            [col4] [varchar](50) NULL); 
CREATE TABLE testtable_guid ([col1] [uniqueidentifier] NOT NULL primary key clustered,
                             [col2] [int] NULL,
                             [col3] [int] NULL,
                             [col4] [varchar](50) NULL); 
DECLARE @val INT
SELECT @val=1
WHILE @val < 5000000
BEGIN  
   INSERT INTO testtable_int (col1, col2,  col3, col4) 
     VALUES (@val,round(rand()*100000,0),round(rand()*100000,0),'TEST' + CAST(@val AS VARCHAR))
   INSERT INTO testtable_guid (col1, col2,  col3, col4) 
     VALUES (newid(),round(rand()*100000,0),round(rand()*100000,0),'TEST' + CAST(@val AS VARCHAR))
   SELECT @val=@val+1
END
GO 

Before we take a look at the performance let’s first take a look at some of the other index characteristics, disk space usage and fragmentation.  Given that an integer requires 4 bytes and a GUID uses 16 bytes I would expect that the index on the GUID column would be much larger.  Using the query below let’s take a look at just how much more disk space it uses.

 SELECT OBJECT_NAME(i.[object_id]) AS TableName,
       i.[name] AS IndexName,
       SUM(s.[used_page_count]) * 8 AS IndexSizeKB
FROM sys.dm_db_partition_stats AS s
INNER JOIN sys.indexes AS i ON s.[object_id] = i.[object_id] AND s.[index_id] = i.[index_id]
WHERE OBJECT_NAME(i.[object_id]) like '%testtable%'
GROUP BY i.[name],i.[object_id];

TableNameIndexNameIndexSizeKB
testtable_intPK__testtabl__357D0D3E194A02AE178056
testtable_guidPK__testtabl__357D0D3E65B8D323342456

As would be expected the index on the GUID column is using quite a bit more space.  But one thing we should consider when looking at this is whether all this extra space used is due to the fact that it’s a 4 byte datatype vs 16 byte datatype or if there is there some fragmentation that is contributing to the extra disk space being used.  I would suspect that since the GUID values are random there would be quite a bit more fragmentation with this datatype versus the integer which is ever increasing.  Let’s take a look at the fragmentation of each index using the T-SQL below.

 SELECT OBJECT_NAME(ind.OBJECT_ID) AS TableName, 
       ind.name AS IndexName, 
       indexstats.index_type_desc AS IndexType, 
       indexstats.avg_fragmentation_in_percent 
FROM sys.dm_db_index_physical_stats(DB_ID(), NULL, NULL, NULL, NULL) indexstats 
INNER JOIN sys.indexes ind ON ind.object_id = indexstats.object_id AND ind.index_id = indexstats.index_id 
WHERE OBJECT_NAME(ind.OBJECT_ID) like '%testtable%'
ORDER BY indexstats.avg_fragmentation_in_percent DESC 

TableNameIndexNameIndexTypeAvg_fragmentation_in_percent
testtable_intPK__testtabl__357D0D3E194A02AECLUSTERED INDEX0.37432913904298  
testtable_guidPK__testtabl__357D0D3E65B8D323CLUSTERED INDEX99.1926777751701

The above output confirms our assumption, the index on the GUID column does have quite a bit more fragmentation.  Before we fix this let’s first run through some simple DML statements to see if there are any performance differences with the indexes in this state.  As always we can use SQL  Profiler to capture the performance metrics for each statement.  For my test I ran these queries 10 times restoring a baseline backup of the original tables each time in order to rule out any anomalies.  I also ran a DBCC DROPCLEANBUFFERS before each execution to simulate the worst case scenario of the data having to be read from disk.  Note that if you use the T-SQL below you’ll need to update the GUID in the SELECT, UPDATE and DELETE statements with values that were generated when you ran the insert script above.

 -- test select
SELECT * FROM testtable_int WHERE col1=4789341
GO
SELECT * FROM testtable_guid WHERE col1='CDD450DD-0048-4E3A-A074-7BF57D4C5C91'
GO
-- test insert
INSERT INTO testtable_int (col1, col2,  col3, col4) 
   VALUES (5000001,round(rand()*100000,0),round(rand()*100000,0),'TEST' + CAST(5000001 AS VARCHAR))
GO
INSERT INTO testtable_guid (col1, col2,  col3, col4) 
   VALUES (newid(),round(rand()*100000,0),round(rand()*100000,0),'TEST' + CAST(5000001 AS VARCHAR))
GO
-- test update
UPDATE testtable_int SET col2=232342,col3=232340,col4='TESTUPDATE'
WHERE col1=3023481
GO
UPDATE testtable_guid SET col2=232342,col3=232340,col4='TESTUPDATE'
WHERE col1='FA7B4737-70ED-49EA-BD26-19B737294C9D'
GO
-- test delete
DELETE FROM testtable_int WHERE col1=2789341
GO
DELETE FROM testtable_guid WHERE col1='EE053ACD-99C3-49C3-B11A-2D18F8248D29'
GO 

TableNameDML OperationCPU (ms)ReadsWritesDuration (ms)
testtable_intSELECT17114476257
testtable_guidSELECT18723975341
testtable_intINSERT0813
testtable_guidINSERT0812
testtable_intUPDATE01212
testtable_guidUPDATE01213
testtable_intDELETE01422
testtable_guidDELETE01423

Well this is a bit of a surprise.  Both indexes have very similar performance with only the SELECT performance of the integer datatype showing any noticeable edge over the GUID datatype.  Let’s rebuild each index using T-SQL below and see if anything changes from these initial results.

 alter index PK__testtabl__357D0D3E194A02AE on testtable_int rebuild;
alter index PK__testtabl__357D0D3E65B8D323 on testtable_guid rebuild;   

With these indexes rebuilt let’s again take a look at the storage and fragmentation of each index using the queries we used above.

TableNameIndexNameIndexTypeAvg_fragmentation_in_percent
testtable_intPK__testtabl__357D0D3E194A02AECLUSTERED INDEX0. 0315685036529268
testtable_guidPK__testtabl__357D0D3E65B8D323CLUSTERED INDEX0.01
TableNameIndexNameIndexSizeKB
testtable_intPK__testtabl__357D0D3E194A02AE177744
testtable_guidPK__testtabl__357D0D3E65B8D323237440

We can see from these results there is no longer any index fragmentation with the index on the GUID column.  Also notice that even though we recovered quite a bit of space from index on the guid column it is still using a fair bit more than the index on the integer column.  Let’s run our DML statements again and see if the performance has changed at all from our initial test.

TableNameDML OperationCPU (ms)ReadsWritesDuration (ms)
testtable_intSELECT083112
testtable_guidSELECT083115
testtable_intINSERT0812
testtable_guidINSERT01322
testtable_intUPDATE01212
testtable_guidUPDATE01213
testtable_intDELETE01422
testtable_guidDELETE01444

With the indexes rebuilt we now see almost no performance difference between the two datatypes.  Assuming you are running regular maintenance on your indexes this simple test shows that you achieve pretty much identical performance between the two datatypes.  One thing to also note is that the GUID column also gives you the added benefit of uniqueness across systems.  If any of your applications merge data generated from different systems then using a GUID column as your key allows you to not have to worry about resolving any duplicates when you merge the data.

Finally, as with any performance testing, you should always test with your own data to ensure you are seeing the expected results before making any changes in your production environment.

Next Steps

2 Comments

  1. Thank you Ben for your interesting quantitative analysis. My view of this may differ somewhat/ The issue is never with the insert statements those will always be similar due to the very slight difference in the amount of data being written. In your analysis, the 25% degradation of the single select, however, does seem to be significant. While 25% on a single table, I have not encountered many situations where we are doing selects from just one table for most enterprise situations. In these situations, the majority of reads are from rather complicated joins that are needed to aggregate the data for the business purpose at hand. When the impact of the multiple joins (typically 3 to 5 needed and in many cases, stored procedures have many of those combinations, you end up with an unacceptable penalty that is difficult to justify. There are 3 paradigms that should be observed in order. Performance 1st, Maintainability 2nd, Cost 3rd (because #1 and #2 directly impact #3 over time). This analysis has confirmed for me at least, that Guids should not be used as PK unless there is some other overriding purpose to them beyond just allowing to implement the Domain Driven Paradigm without critical thinking of the practical impact of that design.

  2. If privacy is a concern, do not use the NEWSEQUENTIALID function. It is possible to guess the value of the next generated GUID and, therefore, access data associated with that GUID.

Leave a Reply

Your email address will not be published. Required fields are marked *