Find and Remove Duplicate Rows from a SQL Server Table

By:   |   Comments (13)   |   Related: > Database Design


Problem

According to database design best practices, a SQL Server table should not contain duplicate rows. During the database design process primary keys should be created to eliminate duplicate rows. However, sometimes we need to work with databases where these rules are not followed or exceptions are possible (when these rules are bypassed knowingly). For example, when a staging table is used and data is loaded from different sources where duplicate rows are possible. When the loading process completes, table should be cleaned or clean data should be loaded to a permanent table, so after that the duplicates are no longer needed. Therefore, an issue concerning the removal of duplicates from the loading table arises. In this tutorial let's examine some ways to solve data de-duplication needs.

Solution

We will consider two cases in this tip:

  • The first case is when a SQL Server table has a primary key (or unique index) and one of the columns contains duplicate values which should be removed. 
  • The second case is that table does not have a primary key or any unique indexes and contains duplicate rows which should be removed.  Let's discuss these cases separately.

How to remove duplicate rows in a SQL Server table

Duplicate records in a SQL Server table can be a very serious issue.  With duplicate data it is possible for orders to be processed numerous times, have inaccurate results for reporting and more.  In SQL Server there are a number of ways to address duplicate records in a table based on the specific circumstances such as:

  • Table with Unique Index - For tables with a unique index, you have the opportunity to use the index to order identify the duplicate data then remove the duplicate records.  Identification can be performed with self-joins, ordering the data by the max value, using the RANK function or using NOT IN logic.
  • Table without a Unique Index - For tables without a unique index, it is a bit more challenging.  In this scenario, the ROW_NUMBER() function can be used with a common table expression (CTE) to sort the data then delete the subsequent duplicate records.

Check out the examples below to get real world examples on how to delete duplicate records from a table.

Removing duplicates rows from a SQL Server table with a unique index

Test Environment Setup

To accomplish our tasks, we need a test environment which we create with the following statement:

USE master
GO

CREATE DATABASE TestDB
GO

USE TestDB
GO

CREATE TABLE TableA
(
 ID INT NOT NULL IDENTITY(1,1),
 Value INT,
 CONSTRAINT PK_ID PRIMARY KEY(ID)  
)

Now let's insert data into our new table - 'TableA' with the following statement:

USE TestDB
GO

INSERT INTO TableA(Value)
VALUES(1),(2),(3),(4),(5),(5),(3),(5)

SELECT *
FROM TableA

SELECT Value, COUNT(*) AS DuplicatesCount
FROM TableA
GROUP BY Value

As we can see in the result set below the values 3 and 5 exists in the 'Value' column more than once:

Test data is queried

Identify Duplicate Rows in a SQL Server Table

Our task is to enforce uniqueness for the 'Value' column by removing duplicates. Removing duplicate values from table with a unique index is a bit easier than removing the rows from a table without it. First of all, we need to find duplicates. There are many different ways to do that. Let's investigate and compare some common ways. In the following queries below there are six solutions to find that duplicate values which should be deleted (leaving only one value):

----- Finding duplicate values in a table with a unique index
--Solution 1
SELECT a.* 
FROM TableA a, (SELECT ID, (SELECT MAX(Value) FROM TableA i WHERE o.Value=i.Value GROUP BY Value HAVING o.ID < MAX(i.ID)) AS MaxValue FROM TableA o) b
WHERE a.ID=b.ID AND b.MaxValue IS NOT NULL

--Solution 2
SELECT a.* 
FROM TableA a, (SELECT ID, (SELECT MAX(Value) FROM TableA i WHERE o.Value=i.Value GROUP BY Value HAVING o.ID=MAX(i.ID)) AS MaxValue FROM TableA o) b
WHERE a.ID=b.ID AND b.MaxValue IS NULL

--Solution 3
SELECT a.*
FROM TableA a
INNER JOIN
(
 SELECT MAX(ID) AS ID, Value 
 FROM TableA
 GROUP BY Value 
 HAVING COUNT(Value) > 1
) b
ON a.ID < b.ID AND a.Value=b.Value

--Solution 4
SELECT a.* 
FROM TableA a 
WHERE ID < (SELECT MAX(ID) FROM TableA b WHERE a.Value=b.Value GROUP BY Value HAVING COUNT(*) > 1)

--Solution 5 
SELECT a.*
FROM TableA a
INNER JOIN
(SELECT ID, RANK() OVER(PARTITION BY Value ORDER BY ID DESC) AS rnk FROM TableA ) b 
ON a.ID=b.ID
WHERE b.rnk > 1

--Solution 6 
SELECT * 
FROM TableA 
WHERE ID NOT IN (SELECT MAX(ID) 
                 FROM TableA 
                 GROUP BY Value)

As we can see the result for all cases is the same as shown in the screenshot below:

Different techniques to identify duplicate rows

Only rows with ID=3, 5, 6 need to be deleted. Looking at the execution plan we can see that latest - the most 'compact' solution ('Solution 6') has a highest cost (in our example there is a primary key on the 'ID' column, so 'NULL' values are not possible for that column, therefore 'NOT IN' will work without any problem), and the second has the lowest cost:

Execution plans for the duplicate row code

Deleting Duplicate Rows in a SQL Server Table

Now by using the following queries, let's delete duplicate values from the table. To simplify our process, we will use only the second, the fifth and the sixth queries:

USE TestDB
GO

--Initializing the table
TRUNCATE TABLE TableA

INSERT INTO TableA(Value)
VALUES(1),(2),(3),(4),(5),(5),(3),(5)

--Deleting duplicate values
DELETE t
FROM TableA t
WHERE ID IN ( SELECT a.ID FROM TableA a, (SELECT ID, (SELECT MAX(Value) FROM TableA i WHERE o.Value=i.Value GROUP BY Value HAVING o.ID=MAX(i.ID)) AS MaxValue FROM TableA o) b
    WHERE a.ID=b.ID AND b.MaxValue IS NULL) 

--Initializing the table
TRUNCATE TABLE TableA

INSERT INTO TableA(Value)
VALUES(1),(2),(3),(4),(5),(5),(3),(5)

--Deleting duplicate values
DELETE a
FROM TableA a
INNER JOIN
(SELECT ID, RANK() OVER(PARTITION BY Value ORDER BY ID DESC) AS rnk FROM TableA ) b 
ON a.ID=b.ID
WHERE b.rnk>1

--Initializing the table
TRUNCATE TABLE TableA

INSERT INTO TableA(Value)
VALUES(1),(2),(3),(4),(5),(5),(3),(5)

--Deleting duplicate values
DELETE FROM TableA 
WHERE ID NOT IN (SELECT MAX(ID) 
                 FROM TableA 
                 GROUP BY Value)

Deleting the data and looking into the execution plans again we see that the fastest is the first DELETE statement and the slowest is the last as expected:

Query plans for deleting the duplicate data

Removing duplicates from table without a unique index in ORACLE

As a means to help illustrate our final example in this tip, I want to explain some similar functionality in Oracle.  Removing duplicate rows from the table without a unique index is a little easier in Oracle than in SQL Server. There is a ROWID pseudo column in Oracle which returns the address of the row. It uniquely identifies the row in the table (usually in the database also, but in this case, there is an exception - if different tables store data in the same cluster they can have the same ROWID). The query below creates and inserts data into table in the Oracle database:

CREATE TABLE TableB (Value INT);

INSERT INTO TableB(Value) VALUES(1);
INSERT INTO TableB(Value) VALUES(2);
INSERT INTO TableB(Value) VALUES(3);
INSERT INTO TableB(Value) VALUES(4);
INSERT INTO TableB(Value) VALUES(5);
INSERT INTO TableB(Value) VALUES(5);
INSERT INTO TableB(Value) VALUES(3);
INSERT INTO TableB(Value) VALUES(5);

Now we are selecting the data and ROWID from the table:

SELECT ROWID, Value FROM TableB;

The result is below:

SELECT ROWID in Oracle to identify duplicates

Now using ROWID, we will easily remove duplicate rows from table:

DELETE TableB
WHERE  rowid not in (
                      SELECT MAX(rowid)
                      FROM  TableB
                      GROUP  BY Value
                    );

We can also remove duplicates using the code below:

DELETE from TableB o
WHERE  rowid < (
                      SELECT MAX(rowid)
                      FROM  TableB i
                      WHERE i.Value=o.Value
                      GROUP  BY Value
                    );

Removing duplicates from a SQL Server table without a unique index

Unlike Oracle, there is no ROWID in SQL Server, so to remove duplicates from the table without a unique index we need to do additional work for generating unique row identifiers:

USE TestDB
GO

CREATE TABLE TableB (Value INT)

INSERT INTO TableB(Value) 
VALUES(1),(2),(3),(4),(5),(5),(3),(5)

SELECT * FROM TableB

; WITH TableBWithRowID AS
(
 SELECT ROW_NUMBER() OVER (ORDER BY Value) AS RowID, Value
 FROM TableB
)

DELETE o
FROM TableBWithRowID o
WHERE RowID < (SELECT MAX(rowID) FROM TableBWithRowID i WHERE i.Value=o.Value GROUP BY Value)

SELECT * FROM TableB

In the code above, we are creating a table with duplicate rows. We are generating unique identifiers using the ROW_NUMBER() function and by using common table expression (CTE) we are deleting duplicates:

Removing duplicates from a SQL Server table without unique index

This code, however, can be replaced with more compact and optimal one:

USE TestDB
GO

--Initializing the table
TRUNCATE TABLE TableB

INSERT INTO TableB(Value) 
VALUES(1),(2),(3),(4),(5),(5),(3),(5)

--Deleting duplicate values
; WITH TableBWithRowID AS
(
 SELECT ROW_NUMBER() OVER (PARTITION BY Value ORDER BY Value) AS RowID, Value
  FROM TableB
)

DELETE o
FROM TableBWithRowID o
WHERE RowID > 1

SELECT * FROM TableB

Having said that, it is possible to identify the physical address of the row in SQL Server as well. Despite the fact that it is practically impossible to find official documentation about this feature, it can be used as an analog to ROWID pseudo column in Oracle. It is called %%physloc%% (since SQL Server 2008) and it is a virtual binary(8) column which shows the physical location of the row. As the value of %%physloc%% is unique for each row, we can use it as a row identifier while removing duplicate rows from a table without a unique index. Thus, we can remove duplicate rows from a table without a unique index in SQL Server like in Oracle as well as like in the case when the table has a unique index.

The first two queries below are the equivalent versions of removing duplicates in Oracle, the next two are queries for removing duplicates using %%physloc%% similar to the case of the table with a unique index, and in the last query, %%physloc%% is not used just for comparing performance of all of these options:

-- option 1
--Initializing the table
TRUNCATE TABLE TableB

INSERT INTO TableB(Value)
VALUES(1),(2),(3),(4),(5),(5),(3),(5)

DELETE o
FROM
(SELECT %%physloc%% as RowID, value FROM TableB) o
WHERE o.RowID < (
                   SELECT MAX(%%physloc%%)
                   FROM TableB i
                   WHERE i.Value=o.Value
                   GROUP BY Value
                                  )

-----------
-- option 2
--Initializing the table
TRUNCATE TABLE TableB

INSERT INTO TableB(Value)
VALUES(1),(2),(3),(4),(5),(5),(3),(5)

DELETE TableB
WHERE %%physloc%% not in (
                   SELECT MAX(%%physloc%%)
                   FROM TableB
                   GROUP BY Value
                                   )

-----------
-- option 3
--Initializing the table
TRUNCATE TABLE TableB

INSERT INTO TableB(Value)
VALUES(1),(2),(3),(4),(5),(5),(3),(5)

DELETE b1
FROM 
(SELECT %%physloc%% as RowID, value FROM TableB) b1
INNER JOIN
(SELECT %%physloc%% as RowID, RANK() OVER(PARTITION BY Value ORDER BY %%physloc%% DESC) AS rnk FROM TableB ) b2 
ON b1.RowID=b2.RowID
WHERE b2.rnk>1

-----------
-- option 4
--Initializing the table
TRUNCATE TABLE TableB

INSERT INTO TableB(Value)
VALUES(1),(2),(3),(4),(5),(5),(3),(5)

DELETE b1
FROM Tableb b1 
WHERE %%physloc%% < (SELECT MAX(%%physloc%%) 
                     FROM Tableb b2 
                     WHERE b1.Value=b2.Value 
                     GROUP BY Value 
                     HAVING COUNT(*) > 1
                    )

-----------
-- option 5
--Initializing the table
TRUNCATE TABLE TableB

INSERT INTO TableB(Value)
VALUES(1),(2),(3),(4),(5),(5),(3),(5)

; WITH 
TableBWithRowID AS
(
   SELECT ROW_NUMBER() OVER (partition by Value ORDER BY Value) AS RowID, Value
   FROM TableB
)

DELETE o
FROM TableBWithRowID o
WHERE RowID > 1

Analyzing the Execution Plans, we can see that the first and the last queries are the fastest when compared to the overall batch times:

query performance

Hence, we can conclude that in general, using %%physloc%% does not improve the performance. While using this approach, it is very important to realize that this is an undocumented feature of SQL Server and, therefore, developers should be very careful.

There are other ways to remove duplicates which is not discussed in this tip. For example, we can store distinct rows in a temporary table, then delete all data from our table and after that insert distinct rows from temporary table to our permanent table. In this case DELETE and INSERT statements should be included in one transaction.

Conclusion

During our experience we face situations when we need to clean duplicate values from SQL Server tables. The duplicate values can be in the column which will be de-duplicated based on our requirements or the table can contain duplicate rows.  In either case we need to exclude the data to avoid data duplication in the database. In this tip we explained some techniques which hopefully will be helpful to solve these types of problems.

Next Steps


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Sergey Gigoyan Sergey Gigoyan is a database professional with more than 10 years of experience, with a focus on database design, development, performance tuning, optimization, high availability, BI and DW design.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips



Comments For This Article




Thursday, November 11, 2021 - 10:13:55 AM - AliceTeacher Back To Top (89433)
"Remove duplicates" becomes much more complex (for me) when I have to consider both of these 2 items:
1. I want to decide what I consider a "duplicate". (Only check these 3 columns, not all of them, determines my defined "duplicate")
2. When a duplicate is found, I want to decide which of the 2 records to display. (The record that has a value in a certain column)



Wednesday, April 21, 2021 - 7:00:04 AM - Bui Tan Duoc Back To Top (88574)
Removing duplicates rows from a SQL Server table with a unique index. I have column ID (unique index) and 4 more data columns. All data columns are the same but ID. I want to keep one? What is the sql script? Thanks.

Wednesday, August 7, 2019 - 2:25:23 AM - Sergey Gigoyan Back To Top (81989)

Thank you for your comment Mario Kresnadi. The mentioned query is not a single query and cannot be executed separately. You have copied only the first part of the query. In the article, the full query is the following:

; WITH TableBWithRowID AS
(
 SELECT ROW_NUMBER() OVER (ORDER BY Value) AS RowID, Value
 FROM TableB
)

DELETE o
FROM TableBWithRowID 0
WHERE RowID < (SELECT MAX(rowID) FROM TableBWithRowID i WHERE i.Value=o.Value GROUP BY Value)

Thanks,
Sergey


Monday, November 26, 2018 - 9:10:35 PM - Mario Kresnadi Back To Top (78331)

Sorry but I think it is not the syntax of the T-SQL query; check the incorrect syntax below, excerpted from your query:

SELECT * FROM TableB ; WITH TableBWithRowID AS ( SELECT ROW_NUMBER() OVER (ORDER BY Value) AS RowID, Value FROM TableB )


Tuesday, September 11, 2018 - 1:28:23 PM - Sanjay Yadav Back To Top (77508)

Thanks , very helpful to crack sql interview


Wednesday, November 9, 2016 - 4:28:20 AM - eyad Back To Top (43727)

delete from tbl where unqFld not in (

   select min(unqFld) from tbl group by dupFld having count = 1 -- not repeated, thus excluded

   union /*all*/ select min(unqFld) from tbl group by dupFld having count > 1) -- exlcude first occurences of repeated

-- simple but mosty a slow "not in" operator

 

 


Friday, October 14, 2016 - 4:54:22 PM - jeff_yao Back To Top (43567)

This is decades-long topic. :-)

Just to help other readers, AFAIK, the simplest way is as the following (sql server 2005+) using the same example in the tip

USE TestDB
GO

CREATE TABLE TableB (Value INT)

INSERT INTO TableB(Value) 
VALUES(1),(2),(3),(4),(5),(5),(3),(5)

SELECT * FROM TableB

; WITH TableBWithRowID AS
(
 SELECT ROW_NUMBER() OVER (partition by  Value ORDER BY Value) AS RowID, Value
 FROM TableB
)

DELETE o
FROM TableBWithRowID o
WHERE RowID > 1

SELECT * FROM TableB

Thursday, October 13, 2016 - 12:27:03 PM - Sergey Gigoyan Back To Top (43554)

Hello Marc,

Sure, it's also a good example, and as mentioned in the article there are other solutions to this problem, however  in our examples we are concentrating on removing duplicate records from the existing table and have not expanded on creating new tables (even the temporary ones). Certainly, the query which finds de-duplicated result is a bit simpler, but after that we need other T-SQL command(s), which leaves only the unique values or creates a new table moving those values to it. 

Thanks


Thursday, October 13, 2016 - 11:35:46 AM - Marc Jellinek Back To Top (43553)

 This can be done much simpler.  The following query will present a de-duplicated table:

 

CREATE TABLE [dbo].[DupTest] ([value] [int])

INSERT INTO [dbo].[DupTest] VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9), (2), (5), (9)

/* Any duplicate rows will have [instance] > 1 */

WITH CTE_Ranking AS

(   SELECT ROW_NUMBER() OVER (PARTITION BY [value] ORDER BY [value]) as [instance], 

                [value]

    FROM [dbo].[DupTest]

)

SELECT   *

FROM      CTE_Rankings

WHERE [instance] = 1

 

This will result in a de-duplicated result-set that can be copied into a new or existing table


Monday, October 10, 2016 - 12:00:41 PM - Greg Robidoux Back To Top (43527)

Hi David, the script has been fixed. The o alias was accidently removed, but has been put back into the script.

-Greg


Monday, October 10, 2016 - 8:23:17 AM - David Back To Top (43525)

Hello,

removing duplicates from a SQL Server table without unique index on MSSQL 2012 does not work.

I get error:

 

(8 row(s) affected)

(1 row(s) affected)

(8 row(s) affected)

(1 row(s) affected)

Msg 4104, Level 16, State 1, Line 9

The multi-part identifier "o.Value" could not be bound.

 

Can you help me?

 

Thank you David


Friday, October 7, 2016 - 3:10:31 PM - Sergey Gigoyan Back To Top (43510)

Hello Ola,

Thank you for the comment. In the article we are searching for the rows which should be deleted to eliminate duplicates. We have formulated our task  as only removing duplicates from "Value" column and do not define the criteria exactly which row from that duplicates should remain (in the real world examples such kind of criteria may exist, for example leave only the newest rows from duplicates). As we can see, in our example rows with ID =3, ID=7 have the same value in the "Value" field - "Value" =3 and for rows ID=5, ID=6 and ID=8 the "Value" is 5 . So, we need to leave only one row with "Value"=3 and "Value"=5 and delete others. In the example described in the article we leave rows with the highest ID from duplicates - "OVER(PARTITION BY Value ORDER BY ID DESC) AS rnk". In your example you are using "OVER( PARTITION BY Value ORDER BY ID) AS DupRw", so you are going to leave the rows with the lowest ID from duplicates(ID=3 and ID=5) (by default, the values are being sorted by ascending order) and delete rows with ID=7,6,8. Your example also solves the problem and you will have distinct values in the "Value" column of  the TableA. 

Thanks

Sergey

 


Friday, October 7, 2016 - 10:18:39 AM - Ola Back To Top (43508)

 Hello thanks for the tip, but am a bit lost when you query to find the duplicate records comes up with ID 3,5,6. Please check the code below:

WITH CTE_TGIF AS

(SELECT ID,Value, 

D NSE_RANK() OVER( PARTITION BY Value ORDER BY ID) AS DupRw

FROM TableA

)

SELET *

FROM CTE_TGIF

WHERE DupRw > 1 

 

Any explanation(s) on the ID to be deleted for consistency sake.

 

thanks

Ola

 

 

 















get free sql tips
agree to terms