Learn more about SQL Server tools

mssqltips logo
 

Tutorials          DBA          Dev          BI          Career          Categories          Webcasts          Whitepapers          Today's Tip          Join

Tutorials      DBA      Dev      BI      Categories      Webcasts

DBA    Dev    BI    Categories

Next Webcast - The Cloud won't fix that (click for more info)
 

Find and Remove Duplicate Rows from a SQL Server Table


By:   |   Read Comments (9)   |   Related Tips: More > Database Design

Problem

According to database design best practices, a SQL Server table should not contain duplicate rows. During the database design process primary keys should be created to eliminate duplicate rows. However, sometimes we need to work with databases where these rules are not followed or exceptions are possible (when these rules are bypassed knowingly). For example, when a staging table is used and data is loaded from different sources where duplicate rows are possible. When the loading process completes, table should be cleaned or clean data should be loaded to a permanent table, so after that duplicates are no longer needed. Therefore, an issue concerning the removal of duplicates from the loading table arises. In this tip let's examine some ways to solve data de-duplication needs.

Solution

We will consider two cases in this tip:

  • The first case is when a SQL Server table has a primary key (or unique index) and one of the columns contains duplicate values which should be removed. 
  • The second case is that table does not have a primary key or any unique indexes and contains duplicate rows which should be removed.  Let's discuss these cases separately.

How to remove duplicate rows in a SQL Server table

Duplicate records in a SQL Server table can be a very serious issue.  With duplicate data it is possible for orders to be processed numerous times, have inaccurate results for reporting and more.  In SQL Server there are a number of ways to address duplicate records in a table based on the specific circumstances such as:

  • Table with Unique Index - For tables with a unique index, you have the opportunity to use the index to order identify the duplicate data then remove the duplicate records.  Indentification can be performed with self-joins, ordering the data by the max value, using the RANK function or using NOT IN logic.
  • Table without a Unique Index - For tables without a unique index, it is a bit more challenging.  In this scenario, the ROW_NUMBER() function can be used with a common table expression (CTE) to sort the data then delete the subsequent duplicate records.

Check out the examples below to get real world examples on how to delete duplicate records from a table.

Removing duplicates rows from a SQL Server table with a unique index

Test Environment Setup

To accomplish our tasks, we need a test environment:

USE master
GO

CREATE DATABASE TestDB
GO

USE TestDB
GO

CREATE TABLE TableA
(
 ID INT NOT NULL IDENTITY(1,1),
 Value INT,
 CONSTRAINT PK_ID PRIMARY KEY(ID)  
)

Now let's insert data into 'TableA':

USE TestDB
GO

INSERT INTO TableA(Value)
VALUES(1),(2),(3),(4),(5),(5),(3),(5)

SELECT *
FROM TableA

SELECT Value, COUNT(*) AS DuplicatesCount
FROM TableA
GROUP BY Value

As we can see the values 3 and 5 exists in the 'Value' column more than once:

Test data is queried

Identify Duplicate Rows in a SQL Server Table

Our task is to enforce uniqueness for the 'Value' column by removing duplicates. Removing duplicate values from table with a unique index is a bit easier than removing the rows from a table without it. First of all, we need to find duplicates. There are many different ways to do that. Let's investigate and compare some common ways. In the code below there are six solutions to find that duplicate values which should be deleted (leaving only one value):

----- Finding duplicate values in a table with a unique index
--Solution 1
SELECT a.* 
FROM TableA a, (SELECT ID, (SELECT MAX(Value) FROM TableA i WHERE o.Value=i.Value GROUP BY Value HAVING o.ID < MAX(i.ID)) AS MaxValue FROM TableA o) b
WHERE a.ID=b.ID AND b.MaxValue IS NOT NULL

--Solution 2
SELECT a.* 
FROM TableA a, (SELECT ID, (SELECT MAX(Value) FROM TableA i WHERE o.Value=i.Value GROUP BY Value HAVING o.ID=MAX(i.ID)) AS MaxValue FROM TableA o) b
WHERE a.ID=b.ID AND b.MaxValue IS NULL

--Solution 3
SELECT a.*
FROM 
TableA a
INNER JOIN
(
 SELECT MAX(ID) AS ID, Value 
 FROM TableA
 GROUP BY Value 
 HAVING COUNT(Value) > 1
) b
ON a.ID < b.ID AND a.Value=b.Value

--Solution 4
SELECT a.* 
FROM TableA a 
WHERE ID < (SELECT MAX(ID) FROM TableA b WHERE a.Value=b.Value GROUP BY Value HAVING COUNT(*) > 1)

--Solution 5 
SELECT a.*
FROM TableA a
INNER JOIN
(SELECT ID, RANK() OVER(PARTITION BY Value ORDER BY ID DESC) AS rnk FROM TableA ) b 
ON a.ID=b.ID
WHERE b.rnk > 1

--Solution 6 
SELECT * FROM TableA 
WHERE ID NOT IN (SELECT MAX(ID) 
                 FROM TableA 
        GROUP BY Value)

As we can see the result for all cases is the same:

Different techniques to identify duplicate rows

Only rows with ID=3, 5, 6 need to be deleted. Looking at the execution plan we can see that latest - the most 'compact' solution ('Solution 6') has a highest cost (in our example there is a primary key on the 'ID' column, so 'NULL' values are not possible for that column, therefore 'NOT IN' will work without any problem), and the second has the lowest cost:

Execution plans for the duplicate row code

Deleting Duplicate Rows in a SQL Server Table

Now by using these queries, let's delete duplicate values from the table. To simplify our process, we will use only the second, the fifth and the sixth queries:

USE TestDB
GO

--Initializing the table
TRUNCATE TABLE TableA

INSERT INTO TableA(Value)
VALUES(1),(2),(3),(4),(5),(5),(3),(5)

--Deleting duplicate values
DELETE t
FROM TableA t
WHERE ID IN ( SELECT a.ID FROM TableA a, (SELECT ID, (SELECT MAX(Value) FROM TableA i WHERE o.Value=i.Value GROUP BY Value HAVING o.ID=MAX(i.ID)) AS MaxValue FROM TableA o) b
    WHERE a.ID=b.ID AND b.MaxValue IS NULL) 

--Initializing the table
TRUNCATE TABLE TableA

INSERT INTO TableA(Value)
VALUES(1),(2),(3),(4),(5),(5),(3),(5)

--Deleting duplicate values
DELETE a
FROM TableA a
INNER JOIN
(SELECT ID, RANK() OVER(PARTITION BY Value ORDER BY ID DESC) AS rnk FROM TableA ) b 
ON a.ID=b.ID
WHERE b.rnk>1

--Initializing the table
TRUNCATE TABLE TableA

INSERT INTO TableA(Value)
VALUES(1),(2),(3),(4),(5),(5),(3),(5)

--Deleting duplicate values
DELETE FROM TableA 
WHERE ID NOT IN (SELECT MAX(ID) 
                 FROM TableA 
     GROUP BY Value)

Deleting the data and looking into the execution plans again we see that the fastest is the first DELETE command and the slowest is the last as expected:

Query plans for deleting the duplicate data

Removing duplicates from table without a unique index in ORACLE

As a means to help illustrate our final example in this tip, I want to explain some similar functionality in Oracle.  Removing duplicate rows from the table without a unique index is a little easier in Oracle than in SQL Server. There is a ROWID pseudo column in Oracle which returns the address of the row. It uniquely identifies the row in the table (usually in the database also, but in this case, there is an exception - if different tables store data in the same cluster they can have the same ROWID). The query below creates and inserts data into table in the Oracle database:

CREATE TABLE TableB (Value INT);

INSERT INTO TableB(Value) VALUES(1);
INSERT INTO TableB(Value) VALUES(2);
INSERT INTO TableB(Value) VALUES(3);
INSERT INTO TableB(Value) VALUES(4);
INSERT INTO TableB(Value) VALUES(5);
INSERT INTO TableB(Value) VALUES(5);
INSERT INTO TableB(Value) VALUES(3);
INSERT INTO TableB(Value) VALUES(5);

Now we are selecting the data and ROWID from the table:

SELECT ROWID, Value FROM TableB;

The result is below:

SELECT ROWID in Oracle to identify duplicates

Now using ROWID, we will easily remove duplicate rows from table:

DELETE TableB
WHERE  rowid not in (
                      SELECT MAX(rowid)
                      FROM  TableB
                      GROUP  BY Value
                    );

We can also remove duplicates using the code below:

DELETE from TableB o
WHERE  rowid < (
                      SELECT MAX(rowid)
                      FROM  TableB i
                      WHERE i.Value=o.Value
                      GROUP  BY Value
                    );

Removing duplicates from a SQL Server table without a unique index

In SQL Server there is no equivalent to Oracle's ROWID, so to remove duplicates from the table without a unique index we need to do additional work for generating unique row identifiers:

USE TestDB
GO

CREATE TABLE TableB (Value INT)

INSERT INTO TableB(Value) 
VALUES(1),(2),(3),(4),(5),(5),(3),(5)

SELECT * FROM TableB

; WITH TableBWithRowID AS
(
 SELECT ROW_NUMBER() OVER (ORDER BY Value) AS RowID, Value
 FROM TableB
)

DELETE o
FROM TableBWithRowID 0
WHERE RowID < (SELECT MAX(rowID) FROM TableBWithRowID i WHERE i.Value=o.Value GROUP BY Value)

SELECT * FROM TableB

In the code above, we are creating a table with duplicate rows. We are generating unique identifiers using the ROW_NUMBER() function and by using common table expression (CTE) we are deleting duplicates:

Removing duplicates from a SQL Server table without unique index

There are other ways to remove duplicates which is not discussed in this tip. For example, we can store distinct rows in a temporary table, then delete all data from our table and after that insert distinct rows from temporary table to our permanent table. In this case DELETE and INSERT statements should be included in one transaction.

Conclusion

During our experience we face situations when we need to clean duplicate values from SQL Server tables. The duplicate values can be in the column which will be de-duplicated based on our requirements or the table can contain duplicate rows.  In either case we need to exclude the data to avoid data duplication in the database. In this tip we explained some techniques which hopefully will be helpful to solve these types of problems.

Next Steps


Last Update:


next webcast button


next tip button



About the author
MSSQLTips author Sergey Gigoyan Sergey Gigoyan is a database professional with more than 6 years of experience, with a focus on database design, development, performance tuning, optimization, high availability, BI and DW design.

View all my tips





Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name    *Email    Email me updates 


Signup for our newsletter
 I agree by submitting my data to receive communications, account updates and/or special offers about SQL Server from MSSQLTips and/or its Sponsors. I have read the privacy statement and understand I may unsubscribe at any time.



    



Tuesday, September 11, 2018 - 1:28:23 PM - Sanjay Yadav Back To Top

Thanks , very helpful to crack sql interview


Wednesday, November 09, 2016 - 4:28:20 AM - eyad Back To Top

delete from tbl where unqFld not in (

   select min(unqFld) from tbl group by dupFld having count = 1 -- not repeated, thus excluded

   union /*all*/ select min(unqFld) from tbl group by dupFld having count > 1) -- exlcude first occurences of repeated

-- simple but mosty a slow "not in" operator

 

 


Friday, October 14, 2016 - 4:54:22 PM - jeff_yao Back To Top

This is decades-long topic. :-)

Just to help other readers, AFAIK, the simplest way is as the following (sql server 2005+) using the same example in the tip

USE TestDB
GO

CREATE TABLE TableB (Value INT)

INSERT INTO TableB(Value) 
VALUES(1),(2),(3),(4),(5),(5),(3),(5)

SELECT * FROM TableB

; WITH TableBWithRowID AS
(
 SELECT ROW_NUMBER() OVER (partition by  Value ORDER BY Value) AS RowID, Value
 FROM TableB
)

DELETE o
FROM TableBWithRowID o
WHERE RowID > 1

SELECT * FROM TableB

Thursday, October 13, 2016 - 12:27:03 PM - Sergey Gigoyan Back To Top

Hello Marc,

Sure, it's also a good example, and as mentioned in the article there are other solutions to this problem, however  in our examples we are concentrating on removing duplicate records from the existing table and have not expanded on creating new tables (even the temporary ones). Certainly, the query which finds de-duplicated result is a bit simpler, but after that we need other T-SQL command(s), which leaves only the unique values or creates a new table moving those values to it. 

Thanks


Thursday, October 13, 2016 - 11:35:46 AM - Marc Jellinek Back To Top

 This can be done much simpler.  The following query will present a de-duplicated table:

 

CREATE TABLE [dbo].[DupTest] ([value] [int])

INSERT INTO [dbo].[DupTest] VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9), (2), (5), (9)

/* Any duplicate rows will have [instance] > 1 */

WITH CTE_Ranking AS

(   SELECT ROW_NUMBER() OVER (PARTITION BY [value] ORDER BY [value]) as [instance], 

                [value]

    FROM [dbo].[DupTest]

)

SELECT   *

FROM      CTE_Rankings

WHERE [instance] = 1

 

This will result in a de-duplicated result-set that can be copied into a new or existing table


Monday, October 10, 2016 - 12:00:41 PM - Greg Robidoux Back To Top

Hi David, the script has been fixed. The o alias was accidently removed, but has been put back into the script.

-Greg


Monday, October 10, 2016 - 8:23:17 AM - David Back To Top

Hello,

removing duplicates from a SQL Server table without unique index on MSSQL 2012 does not work.

I get error:

 

(8 row(s) affected)

(1 row(s) affected)

(8 row(s) affected)

(1 row(s) affected)

Msg 4104, Level 16, State 1, Line 9

The multi-part identifier "o.Value" could not be bound.

 

Can you help me?

 

Thank you David


Friday, October 07, 2016 - 3:10:31 PM - Sergey Gigoyan Back To Top

Hello Ola,

Thank you for the comment. In the article we are searching for the rows which should be deleted to eliminate duplicates. We have formulated our task  as only removing duplicates from "Value" column and do not define the criteria exactly which row from that duplicates should remain (in the real world examples such kind of criteria may exist, for example leave only the newest rows from duplicates). As we can see, in our example rows with ID =3, ID=7 have the same value in the "Value" field - "Value" =3 and for rows ID=5, ID=6 and ID=8 the "Value" is 5 . So, we need to leave only one row with "Value"=3 and "Value"=5 and delete others. In the example described in the article we leave rows with the highest ID from duplicates - "OVER(PARTITION BY Value ORDER BY ID DESC) AS rnk". In your example you are using "OVER( PARTITION BY Value ORDER BY ID) AS DupRw", so you are going to leave the rows with the lowest ID from duplicates(ID=3 and ID=5) (by default, the values are being sorted by ascending order) and delete rows with ID=7,6,8. Your example also solves the problem and you will have distinct values in the "Value" column of  the TableA. 

Thanks

Sergey

 


Friday, October 07, 2016 - 10:18:39 AM - Ola Back To Top

 Hello thanks for the tip, but am a bit lost when you query to find the duplicate records comes up with ID 3,5,6. Please check the code below:

WITH CTE_TGIF AS

(SELECT ID,Value, 

D NSE_RANK() OVER( PARTITION BY Value ORDER BY ID) AS DupRw

FROM TableA

)

SELET *

FROM CTE_TGIF

WHERE DupRw > 1 

 

Any explanation(s) on the ID to be deleted for consistency sake.

 

thanks

Ola

 

 

 


Learn more about SQL Server tools