By: Ben Snaidero | Last Updated: 2017-09-29 | Comments (10) | Database Design
Over the years I have read quite a few articles about which datatypes should be used for primary keys. At some point the conversation ends up with one side thinking an integer is a better option and the other side leaning more towards using a GUID.
In this tip we will aim to identify any performance differences between these two datatypes. We will also look at a couple other index characteristics associated with each type to see if there are any other benefits that might outweigh any performance differences.
As many of my tips deal with SQL Server performance we will test the performance of these two datatypes by running through a simple test case to see if we can measure any significant performance differences. We will setup two almost identical tables with the only variance being the datatype of the primary key column.
Below is the T-SQL to setup and load these tables.
CREATE TABLE testtable_int ([col1] [int] NOT NULL primary key clustered, [col2] [int] NULL, [col3] [int] NULL, [col4] [varchar](50) NULL); CREATE TABLE testtable_guid ([col1] [uniqueidentifier] NOT NULL primary key clustered, [col2] [int] NULL, [col3] [int] NULL, [col4] [varchar](50) NULL); DECLARE @val INT SELECT @val=1 WHILE @val < 5000000 BEGIN INSERT INTO testtable_int (col1, col2, col3, col4) VALUES (@val,round(rand()*100000,0),round(rand()*100000,0),'TEST' + CAST(@val AS VARCHAR)) INSERT INTO testtable_guid (col1, col2, col3, col4) VALUES (newid(),round(rand()*100000,0),round(rand()*100000,0),'TEST' + CAST(@val AS VARCHAR)) SELECT @[email protected]+1 END GO
Before we take a look at the performance let's first take a look at some of the other index characteristics, disk space usage and fragmentation. Given that an integer requires 4 bytes and a GUID uses 16 bytes I would expect that the index on the GUID column would be much larger. Using the query below let's take a look at just how much more disk space it uses.
SELECT OBJECT_NAME(i.[object_id]) AS TableName, i.[name] AS IndexName, SUM(s.[used_page_count]) * 8 AS IndexSizeKB FROM sys.dm_db_partition_stats AS s INNER JOIN sys.indexes AS i ON s.[object_id] = i.[object_id] AND s.[index_id] = i.[index_id] WHERE OBJECT_NAME(i.[object_id]) like '%testtable%' GROUP BY i.[name],i.[object_id];
As would be expected the index on the GUID column is using quite a bit more space. But one thing we should consider when looking at this is whether all this extra space used is due to the fact that it's a 4 byte datatype vs 16 byte datatype or if there is there some fragmentation that is contributing to the extra disk space being used. I would suspect that since the GUID values are random there would be quite a bit more fragmentation with this datatype versus the integer which is ever increasing. Let's take a look at the fragmentation of each index using the T-SQL below.
SELECT OBJECT_NAME(ind.OBJECT_ID) AS TableName, ind.name AS IndexName, indexstats.index_type_desc AS IndexType, indexstats.avg_fragmentation_in_percent FROM sys.dm_db_index_physical_stats(DB_ID(), NULL, NULL, NULL, NULL) indexstats INNER JOIN sys.indexes ind ON ind.object_id = indexstats.object_id AND ind.index_id = indexstats.index_id WHERE OBJECT_NAME(ind.OBJECT_ID) like '%testtable%' ORDER BY indexstats.avg_fragmentation_in_percent DESC
The above output confirms our assumption, the index on the GUID column does have quite a bit more fragmentation. Before we fix this let's first run through some simple DML statements to see if there are any performance differences with the indexes in this state. As always we can use SQL Profiler to capture the performance metrics for each statement. For my test I ran these queries 10 times restoring a baseline backup of the original tables each time in order to rule out any anomalies. I also ran a DBCC DROPCLEANBUFFERS before each execution to simulate the worst case scenario of the data having to be read from disk. Note that if you use the T-SQL below you'll need to update the GUID in the SELECT, UPDATE and DELETE statements with values that were generated when you ran the insert script above.
-- test select SELECT * FROM testtable_int WHERE col1=4789341 GO SELECT * FROM testtable_guid WHERE col1='CDD450DD-0048-4E3A-A074-7BF57D4C5C91' GO -- test insert INSERT INTO testtable_int (col1, col2, col3, col4) VALUES (5000001,round(rand()*100000,0),round(rand()*100000,0),'TEST' + CAST(5000001 AS VARCHAR)) GO INSERT INTO testtable_guid (col1, col2, col3, col4) VALUES (newid(),round(rand()*100000,0),round(rand()*100000,0),'TEST' + CAST(5000001 AS VARCHAR)) GO -- test update UPDATE testtable_int SET col2=232342,col3=232340,col4='TESTUPDATE' WHERE col1=3023481 GO UPDATE testtable_guid SET col2=232342,col3=232340,col4='TESTUPDATE' WHERE col1='FA7B4737-70ED-49EA-BD26-19B737294C9D' GO -- test delete DELETE FROM testtable_int WHERE col1=2789341 GO DELETE FROM testtable_guid WHERE col1='EE053ACD-99C3-49C3-B11A-2D18F8248D29' GO
|TableName||DML Operation||CPU (ms)||Reads||Writes||Duration (ms)|
Well this is a bit of a surprise. Both indexes have very similar performance with only the SELECT performance of the integer datatype showing any noticeable edge over the GUID datatype. Let's rebuild each index using T-SQL below and see if anything changes from these initial results.
alter index PK__testtabl__357D0D3E194A02AE on testtable_int rebuild; alter index PK__testtabl__357D0D3E65B8D323 on testtable_guid rebuild;
With these indexes rebuilt let's again take a look at the storage and fragmentation of each index using the queries we used above.
|testtable_int||PK__testtabl__357D0D3E194A02AE||CLUSTERED INDEX||0. 0315685036529268|
We can see from these results there is no longer any index fragmentation with the index on the GUID column. Also notice that even though we recovered quite a bit of space from index on the guid column it is still using a fair bit more than the index on the integer column. Let's run our DML statements again and see if the performance has changed at all from our initial test.
|TableName||DML Operation||CPU (ms)||Reads||Writes||Duration (ms)|
With the indexes rebuilt we now see almost no performance difference between the two datatypes. Assuming you are running regular maintenance on your indexes this simple test shows that you achieve pretty much identical performance between the two datatypes. One thing to also note is that the GUID column also gives you the added benefit of uniqueness across systems. If any of your applications merge data generated from different systems then using a GUID column as your key allows you to not have to worry about resolving any duplicates when you merge the data.
Finally, as with any performance testing, you should always test with your own data to ensure you are seeing the expected results before making any changes in your production environment.
- Extend testing further to also compare using sequential GUIDs and the bigint datatype
- Read other tips on indexing
- Read other tips on performance tuning
- Here is a link to a related post.
Last Updated: 2017-09-29
About the author
View all my tips