Learn more about SQL Server tools



solving sql server problems for millions of dbas and developers since 2006 join MSSQLTips for free SQL Server tips













































   Got a SQL tip?
            We want to know!

Removing Duplicates Rows with SSIS Sort Transformation

MSSQLTips author Brady Upton By:   |   Read Comments (5)   |   Related Tips: More > Integration Services Data Flow Transformations
Problem

I have multiple duplicate records in my SQL Server database. What is a quick and easy way to remove them using SSIS?

Solution

There are multiple ways to remove duplicate records in SQL Server. In this tip, I'll use the SSIS Sort Transformation to remove records and show you how easy it can be.

The SSIS Sort Transformation task is useful when you need to sort data into a certain sort order. You can compare it to the ORDER BY clause in a SELECT statement. Books Online explains it as:

"The Sort transformation sorts input data in ascending or descending order and copies the sorted data to the transformation output. You can apply multiple sorts to an input; each sort is identified by a numeral that determines the sort order. The column with the lowest number is sorted first, the sort column with the second lowest number is sorted next, and so on"

SSIS Sort Task in Action

First, open Visual Studio (or Business Intelligence Dev Studio if you're using pre SQL Server 2012) and create an SSIS project. Next, we can go ahead and make a connection to our database. Right click Connection Managers in Solution Explorer and choose New Connection Manager:

There are multiple ways to remove duplicate records in SQL Server

Choose your Connection Manager type. In this example, we'll use OLEDB. Next, configure the Connection Manager to point to your dataset. In this example, I'll use localhost and my Dev database:

Choose your Connection Manager type

Test the connection and click OK. Next, drag a Data Flow task from the SSIS toolbox onto the design screen:

drag a Data Flow task from the SSIS toolbox onto the design screen

Right click the Data Flow task and choose Edit. You are now inside the data flow task. This is where all the action happens. Drag an OLEDB source task from the SSIS toolbox to the design screen:

Right click the Data Flow task and choose Edit

Right click the OLEDB task and choose Edit. This screen is where we will define the connection manager we created earlier. Under OLEDB connection manager choose the connection you created. Leave data access mode as Table or view. Change the name of the table or the view to the table that has duplicate data that needs to be removed. In this example, I'll use a table named Teams:

Right click the OLEDB task and choose Edit

To preview the data click Preview. In my example, you can see I have duplicates in the Team, City and State columns:

you can see I have duplicates in the Team, City and State columns

Click OK to close the OLEDB Source task. Drag the Sort Transformation task onto the design screen. Connect the OLEDB Source task to the Sort task:

Connect the OLEDB Source task to the Sort task

Right click the Sort task and choose Edit. Here is where we can sort our data. Let's say I want to sort my data by State. Under Available Input Columns, I'll choose State:

Right click the Sort task and choose Edit

Click OK. Drag the Derived Column task from the SSIS toolbox onto the design screen. Connect the Sort task to the Derived Column task:

Drag the Derived Column task from the SSIS toolbox onto the design screen

Right click on the precedence constraint between Sort and Derived column and click Enable Date Viewer. This will allow us to view the data as it passes through the constraint:

Right click on the precedence constraint between Sort and Derived column and click Enable Date Viewer

Let's view our data sorted by State. Click the play button on the toolbar to debug:

Click the play button on the toolbar to debug

Tada! You can see the data has been sorted by State:

You can see the data has been sorted by State

Removing Duplicate Rows Using SSIS Sort

But wait....what does this have to do with removing duplicates? Close the Data Viewer and click the stop button on the toolbar to stop debugging. Right click the Sort task again and you'll notice down at the bottom, "Remove rows with duplicate values". I know, I know, you're thinking no way that it's this easy. Click the remove rows option and choose OK:

Click the remove rows option and choose OK

Click the play button on the toolbar again to view the results. On the design screen, you can see that I passed 20 rows to the sort column but the sort column only passed 11 rows to the next task. This means the transformation removed 9 duplicates based on the column state:

Click the play button on the toolbar again to view the results

The package worked the way I designed it but I don't want to remove State duplicates. I want to remove Team, City and State duplicates. Back in design view, right click the Sort task and choose Edit. Add Team and City to the input columns and click OK:

The package worked the way I designed it but I don't want to remove State duplicates

Now, click play one more time...

You can now see I have "Distinct" Team, City and State columns:

You can now see I have

Next Steps
  • To move the new dataset to a location just add a destination task in place of the derived column task.
  • View more SSIS Data Flow Transformation tips courtesy of MSSQLTips.com here


Last Update: 9/20/2013


About the author
MSSQLTips author Brady Upton
Brady Upton is a Database Administrator and SharePoint superstar in Nashville, TN.

View all my tips


print tip Print  
Become a paid author





join MSSQLTips for free SQL Server tips     



Learn more about SQL Server tools
Comments and Feedback:
Saturday, September 21, 2013 - 3:40:30 PM - Jesse Duncan Read The Tip

That was great thanks for the in depth steps.

 


Monday, January 20, 2014 - 8:37:52 AM - Koen Verbeeck Read The Tip

I believe it is important to notice that the sort component is a blocking transformation: it needs to load all of the source rows into memory before it even outputs one row. This package is absolutely not scalable and will eat available memory for large data sets until it comes to a grinding halt when it starts swapping out to disk.

On topic: very nice Ktutorial :)


Friday, May 09, 2014 - 4:12:27 AM - back Read The Tip

 

hi bro if am have  a 2 rows 
such as

:id:name     :L_name:

 1  :  calos  :  NULL

 1  : NULL   :  LA

 

and i want to my data in ssis is  

:id:name     :L_name:

 1  :  calos  :  LA
 can you help me 
 

 


Monday, June 02, 2014 - 11:58:36 AM - bhuvaneswara reddy Read The Tip

Hi sir,

 your sended only eliminate the duplicate values, but i want eliminate duplicated values also going another table. please send the information how to do that.


Monday, July 14, 2014 - 4:48:03 AM - VanteGud Read The Tip

Hello Brady,

Thanks for a nice a clean post.

Could you clarify something for me: If I have a table with, say, three columns and I do a "remove duplicates" on 'Key' And 'Value1' columns and lets say I have the following values in my columns:

Key Value1 Value2
1 Cat White
1 Cat Black
2 Parrot Red
3 Owl Blue

What would be my output of Value2 (Key=1)? White or Black? And can I add a sorting or something to control which one I get? (Time would be a good example of a needed sorting).

 

Thanks in advance,



Post a Comment or Question

Keep it clean and stay on the subject or we may delete your comment.
Your email address is not published. Required fields are marked with an asterisk (*)

*Name   *Email Notify for updates



Comments
Get free SQL tips:

*Enter Code refresh code


 
Sponsor Information







Copyright (c) 2006-2014 Edgewood Solutions, LLC All rights reserved
privacy | disclaimer | copyright | advertise | about
authors | contribute | feedback | giveaways | free t-shirt | user groups | community | events | first timer?
Some names and products listed are the registered trademarks of their respective owners.