Removing Duplicates from Strings in SQL Server
By: Aaron Bertrand | Comments (15) | Related: More > TSQL
I have seen many systems that accept or store multi-valued strings in parameters, variables, or columns, with no process set up for de-duplication. Typically these are comma-separated; but really, the delimiter isn't important, the problem is that there are multiple facts embedded in a single value.
Let's say we have a string like this:
And we want to turn it into this (distinct values ordered by original appearance):
Or this (distinct values ordered alphabetically):
Assuming that we can't fix the application to remove duplicates before SQL Server ever sees them, and we can't fix the schema so that the data can be separated and normalized, how would we accomplish this cleanup in Transact-SQL?
In this tip, I will show you my approach to this kind of problem, which (if the application can't be fixed and the data can't be normalized) will almost always try to lean toward a set-based solution. The typical response to such a problem will often point toward some kind of looping strategy; SQL Server is not very efficient with loops.
First, we need a Numbers table in order to support string splitting. There are multiple approaches to breaking strings apart, but the Numbers table is my favorite, because it is simple to implement, has a small disk/memory footprint, performs well enough, and is more flexible than some of the other common approaches. One way to create a numbers table is as follows:
DECLARE @UpperLimit INT = 1000000; ;WITH n(rn) AS ( SELECT ROW_NUMBER() OVER (ORDER BY s1.[object_id]) FROM sys.all_columns AS s1 CROSS JOIN sys.all_columns AS s2 ) SELECT [Number] = rn INTO dbo.Numbers FROM n WHERE rn <= @UpperLimit; CREATE UNIQUE CLUSTERED INDEX n ON dbo.Numbers([Number]) -- WITH (DATA_COMPRESSION = PAGE);
This will create a Numbers table with 1,000,000 rows, which by default will take about 17MB (with page compression, it will only require about 11MB). You may need more or fewer numbers depending on your usage and requirements; for this tip, you probably really only need 4,000 or 8,000 numbers to deal with the most common maximum string lengths.
As an aside, I typically create this table in a Utility database, so that a single copy can be used centrally from any other database. You can use synonyms to point to it so that your code does not need to use three-part names.
Now with the Numbers table in place, you can create a function that uses the table to break strings apart:
CREATE FUNCTION dbo.SplitString ( @List NVARCHAR(MAX), @Delim NVARCHAR(32) ) RETURNS TABLE WITH SCHEMABINDING AS RETURN ( SELECT rn, vn = ROW_NUMBER() OVER (PARTITION BY [Value] ORDER BY rn), [Value] FROM ( SELECT rn = ROW_NUMBER() OVER (ORDER BY CHARINDEX(@Delim, @List + @Delim)), [Value] = LTRIM(RTRIM(SUBSTRING(@List, [Number], CHARINDEX(@Delim, @List + @Delim, [Number]) - [Number]))) FROM dbo.Numbers WHERE Number <= LEN(@List) AND SUBSTRING(@Delim + @List, [Number], LEN(@Delim)) = @Delim ) AS x );
As a quick test, we can see how this splitting function outputs our original string (and how we can use the rn and vn output columns to sort and remove duplicates, respectively):
DECLARE @List NVARCHAR(MAX), @Delim NVARCHAR(32); SELECT @List = N'Bravo/Alpha/Bravo/Tango/Delta/Bravo/Alpha/Delta', @Delim = N'/'; SELECT rn, vn, [Value] FROM dbo.SplitString(@List, @Delim);
rn vn Value ------ ------ ----- 2 1 Alpha 7 2 Alpha 1 1 Bravo 3 2 Bravo 6 3 Bravo 5 1 Delta 8 2 Delta 4 1 Tango --^ any vn <> 1 can be filtered out as a duplicate --^ row number can be used to preserve original order
Finally, we need a function that will put the strings back together, using FOR XML PATH:
CREATE FUNCTION dbo.ReassembleString ( @List NVARCHAR(MAX), @Delim NVARCHAR(32), @Sort NVARCHAR(32) ) RETURNS NVARCHAR(MAX) WITH SCHEMABINDING AS BEGIN RETURN ( SELECT newval = STUFF(( SELECT @Delim + x.[Value] FROM dbo.SplitString(@List, @Delim) AS x WHERE (x.vn = 1) -- filter out duplicates ORDER BY CASE @Sort WHEN N'OriginalOrder' THEN CONVERT(INT, x.rn) WHEN N'Alphabetical' THEN CONVERT(NVARCHAR(4000), x.value) ELSE CONVERT(SQL_VARIANT, NULL) END FOR XML PATH, TYPE).value(N'.',N'nvarchar(max)'),1,1,N'') ); END
With these in place, removing duplicates from our original string and reassembling it in the desired order becomes quite trivial:
DECLARE @List NVARCHAR(MAX), @Delim NVARCHAR(32); SELECT @List = N'Bravo/Alpha/Bravo/Tango/Delta/Bravo/Alpha/Delta', @Delim = N'/'; SELECT OriginalOrder = dbo.ReassembleString(@List, @Delim, N'OriginalOrder'), Alphabetical = dbo.ReassembleString(@List, @Delim, N'Alphabetical');
OriginalOrder Alphabetical ----------------------- ----------------------- Bravo/Alpha/Tango/Delta Alpha/Bravo/Delta/Tango
Now, this is not going to be very efficient at scale, but should be quite acceptable for smaller sets, or one-time mass updates - certainly fast compared to looping constructs you may be using now.
- Implement a Numbers table in a central utility database, and create the two functions to assist with string splitting and assembling.
- See these related tips and other resources:
About the author
View all my tips