Learn more about SQL Server tools

mssqltips logo
giveaway
 

Tutorials          DBA          Dev          BI          Career          Categories          Webcasts          Whitepapers          Today's Tip          Join

Tutorials      DBA      Dev      BI      Categories      Webcasts

DBA    Dev    BI    Categories

 

Removing Duplicates from Strings in SQL Server


By:   |   Read Comments (9)   |   Related Tips: More > T-SQL


Free MSSQLTips Webcast Today >> Optimize SQL Server Performance


Problem

I have seen many systems that accept or store multi-valued strings in parameters, variables, or columns, with no process set up for de-duplication. Typically these are comma-separated; but really, the delimiter isn't important, the problem is that there are multiple facts embedded in a single value.

Let's say we have a string like this:

 Bravo/Alpha/Bravo/Tango/Delta/Bravo/Alpha/Delta

And we want to turn it into this (distinct values ordered by original appearance):

 Bravo/Alpha/Tango/Delta

Or this (distinct values ordered alphabetically):

 Alpha/Bravo/Delta/Tango

Assuming that we can't fix the application to remove duplicates before SQL Server ever sees them, and we can't fix the schema so that the data can be separated and normalized, how would we accomplish this cleanup in Transact-SQL?

Solution

In this tip, I will show you my approach to this kind of problem, which (if the application can't be fixed and the data can't be normalized) will almost always try to lean toward a set-based solution. The typical response to such a problem will often point toward some kind of looping strategy; SQL Server is not very efficient with loops.

First, we need a Numbers table in order to support string splitting. There are multiple approaches to breaking strings apart, but the Numbers table is my favorite, because it is simple to implement, has a small disk/memory footprint, performs well enough, and is more flexible than some of the other common approaches. One way to create a numbers table is as follows:

DECLARE @UpperLimit INT = 1000000;

;WITH n(rn) AS
(
  SELECT ROW_NUMBER() OVER (ORDER BY s1.[object_id])
  FROM sys.all_columns AS s1
  CROSS JOIN sys.all_columns AS s2
)
SELECT [Number] = rn
INTO dbo.Numbers FROM n
WHERE rn <= @UpperLimit;

CREATE UNIQUE CLUSTERED INDEX n ON dbo.Numbers([Number])
-- WITH (DATA_COMPRESSION = PAGE);

This will create a Numbers table with 1,000,000 rows, which by default will take about 17MB (with page compression, it will only require about 11MB). You may need more or fewer numbers depending on your usage and requirements; for this tip, you probably really only need 4,000 or 8,000 numbers to deal with the most common maximum string lengths.

As an aside, I typically create this table in a Utility database, so that a single copy can be used centrally from any other database. You can use synonyms to point to it so that your code does not need to use three-part names.

Now with the Numbers table in place, you can create a function that uses the table to break strings apart:

CREATE FUNCTION dbo.SplitString
(
  @List  NVARCHAR(MAX),
  @Delim NVARCHAR(32)
)
RETURNS TABLE
WITH SCHEMABINDING
AS
  RETURN
  (
    SELECT 
      rn, 
      vn = ROW_NUMBER() OVER (PARTITION BY [Value] ORDER BY rn), 
      [Value]
    FROM 
    ( 
      SELECT 
        rn = ROW_NUMBER() OVER (ORDER BY CHARINDEX(@Delim, @List + @Delim)),
        [Value] = LTRIM(RTRIM(SUBSTRING(@List, [Number],
                  CHARINDEX(@Delim, @List + @Delim, [Number]) - [Number])))
      FROM dbo.Numbers
      WHERE Number <= LEN(@List)
      AND SUBSTRING(@Delim + @List, [Number], LEN(@Delim)) = @Delim
    ) AS x
  );

As a quick test, we can see how this splitting function outputs our original string (and how we can use the rn and vn output columns to sort and remove duplicates, respectively):

DECLARE @List NVARCHAR(MAX), @Delim NVARCHAR(32);

SELECT @List  = N'Bravo/Alpha/Bravo/Tango/Delta/Bravo/Alpha/Delta', 
       @Delim = N'/';

SELECT rn, vn, [Value] FROM dbo.SplitString(@List, @Delim);

Results:

  rn      vn      Value
  ------  ------  -----
  2       1       Alpha
  7       2       Alpha
  1       1       Bravo
  3       2       Bravo
  6       3       Bravo
  5       1       Delta
  8       2       Delta
  4       1       Tango
  
--^ any vn <> 1 can be filtered out as a duplicate
        
--^ row number can be used to preserve original order

Finally, we need a function that will put the strings back together, using FOR XML PATH:

CREATE FUNCTION dbo.ReassembleString
(
  @List  NVARCHAR(MAX),
  @Delim NVARCHAR(32),
  @Sort  NVARCHAR(32)
)
RETURNS NVARCHAR(MAX)
WITH SCHEMABINDING
AS
BEGIN
  RETURN 
  ( 
    SELECT newval = STUFF((
     SELECT @Delim + x.[Value] 
     FROM dbo.SplitString(@List, @Delim) AS x
     WHERE (x.vn = 1) -- filter out duplicates
     ORDER BY CASE @Sort
       WHEN N'OriginalOrder' THEN CONVERT(INT, x.rn)
       WHEN N'Alphabetical'  THEN CONVERT(NVARCHAR(4000), x.value)
       ELSE CONVERT(SQL_VARIANT, NULL) END
     FOR XML PATH, TYPE).value(N'.[1]',N'nvarchar(max)'),1,1,N'')
  );
END

With these in place, removing duplicates from our original string and reassembling it in the desired order becomes quite trivial:

DECLARE @List NVARCHAR(MAX), @Delim NVARCHAR(32);

SELECT @List  = N'Bravo/Alpha/Bravo/Tango/Delta/Bravo/Alpha/Delta', 
       @Delim = N'/';

SELECT OriginalOrder = dbo.ReassembleString(@List, @Delim, N'OriginalOrder'),
       Alphabetical  = dbo.ReassembleString(@List, @Delim, N'Alphabetical');

Results:

OriginalOrder              Alphabetical
-----------------------    -----------------------
Bravo/Alpha/Tango/Delta    Alpha/Bravo/Delta/Tango

Now, this is not going to be very efficient at scale, but should be quite acceptable for smaller sets, or one-time mass updates - certainly fast compared to looping constructs you may be using now.

Next Steps


Last Update:


signup button

next tip button



About the author
MSSQLTips author Aaron Bertrand Aaron Bertrand is a Senior Consultant at SQL Sentry, Inc., and has been contributing to the community for about two decades, first earning the Microsoft MVP award in 1997.

View all my tips





Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name    *Email    Notify for updates 


SQL tips:

*Enter Code refresh code     



Friday, January 22, 2016 - 8:49:51 AM - Aaron Bertrand Back To Top

 Steve, I posted some performance comparisons here:

http://sqlperformance.com/2016/01/t-sql-queries/comparing-splitting-concat

 


Wednesday, January 20, 2016 - 3:50:38 PM - Aaron Bertrand Back To Top

 @Steve Lots of utility in a numbers table. Some examples here:

http://dba.stackexchange.com/q/11506/1186

And even if this specific functionality is the only one you need it for, if it performs better, I still don't consider that over-engineering. It's one table...

 


Wednesday, January 20, 2016 - 3:28:40 PM - Steve Mangiameli Back To Top

 

My intent was to show that it could be done more simply without the need of the numbers table.  That is the part I deemed over engineered.  But then, if you have one hanging around for various other tasks, then I can just shut my mouth!  I went ahead and modified the Parse function to return a positional value as well and put the rest of the code in a function (using table variables) to mimic what you are doing.  I added the ability to order as received or by alphanumeric.  I still think this is a little easier to follow, though the looping may cause issues if the string became egregiously long.  That said, your way provides a lot more utility if you have need of the numbers table in other places.

New Function with Positional ID

USE master

GO

IF NOT EXISTS (SELECT * FROM sys.objects WHERE type in ('FN', 'IF', 'TF', 'FS', 'FT') AND name = 'gfn_ParseList')

  EXEC sp_executesql N'CREATE FUNCTION gfn_ParseList RETURNS @paresedIDs AS BEGIN SELECT 1 ParsedValue, 1 PositionID RETURN END'

GO

 

ALTER FUNCTION gfn_ParseList (@strToPars VARCHAR(8000), @parseChar CHAR(1))

RETURNS @parsedIDs TABLE

   (ParsedValue VARCHAR(255), PositionID INT IDENTITY)

AS

BEGIN

DECLARE 

  @startPos INT = 0

  , @strLen INT = 0

 

WHILE LEN(@strToPars) >= @startPos

  BEGIN

    

    IF (SELECT CHARINDEX(@parseChar,@strToPars,(@startPos+1))) > @startPos

      SELECT @strLen  = CHARINDEX(@parseChar,@strToPars,(@startPos+1))  - @startPos

    ELSE

      BEGIN

        SET @strLen = LEN(@strToPars) - (@startPos -1)

        

        INSERT @parsedIDs

        SELECT RTRIM(LTRIM(SUBSTRING(@strToPars,@startPos, @strLen)))

        BREAK

      END

      

    SELECT @strLen  = CHARINDEX(@parseChar,@strToPars,(@startPos+1))  - @startPos

    

    INSERT @parsedIDs

    SELECT RTRIM(LTRIM(SUBSTRING(@strToPars,@startPos, @strLen)))

    

    SET @startPos = @startPos+@strLen+1

  END

RETURN

END  

GO

New Dedupe function, using parse function above

USE TEST

GO

IF NOT EXISTS (SELECT * FROM sys.objects WHERE type in ('FN', 'IF', 'TF', 'FS', 'FT') AND name = 'ufn_DedupeString')

  EXEC sp_executesql N'CREATE FUNCTION ufn_DedupeString() RETURNS NVARCHAR(MAX) AS BEGIN RETURN (SELECT CAST(GETDATE() AS NVARCHAR(100))); END';

GO

 

ALTER FUNCTION ufn_DedupeString (@dupeStr VARCHAR(MAX), @strDelimiter CHAR(1), @maintainOrder BIT)

RETURNS NVARCHAR(MAX)

AS

BEGIN  

  DECLARE @tblStr2Tbl  TABLE (ParsedValue VARCHAR(255), PositionID INT);

  DECLARE @tblDeDupeMe TABLE (ParsedValue VARCHAR(255), PositionID INT);

   

  INSERT @tblStr2Tbl

  SELECT DISTINCT ParsedValue, PositionID FROM master.dbo.gfn_ParseList(@dupeStr,@strDelimiter);  

  WITH cteUniqueValues

  AS

  (

    SELECT DISTINCT ParsedValue

    FROM @tblStr2Tbl

  )

  INSERT @tblDeDupeMe

  SELECT d.ParsedValue

    , CASE @maintainOrder

        WHEN 1 THEN MIN(d.PositionID)

      ELSE ROW_NUMBER() OVER (ORDER BY d.ParsedValue)

    END AS PositionID

  FROM cteUniqueValues u

    JOIN @tblStr2Tbl d ON d.ParsedValue=u.ParsedValue

  GROUP BY d.ParsedValue

  ORDER BY d.ParsedValue

  

  DECLARE 

    @valCount INT

  , @curValue VARCHAR(255) =''

  , @posValue INT=0

  , @dedupedStr VARCHAR(4000)=''; 

  

  SELECT @valCount = COUNT(1) FROM @tblDeDupeMe;

  

  WHILE @valCount > 0

  BEGIN

    SELECT @posValue=a.minPos, @curValue=d.ParsedValue

    FROM (SELECT MIN(PositionID) minPos FROM @tblDeDupeMe WHERE PositionID  > @posValue) a

      JOIN @tblDeDupeMe d ON d.PositionID=a.minPos;

  

    SET @dedupedStr+=@curValue;

    SET @valCount-=1;

    

    IF @valCount > 0

      SET @dedupedStr+='/';

  END

 

  RETURN @dedupedStr;

 

END

GO

 

Function Calls with long repetitive string

USE TEST

GO

 

DECLARE 

  @dupeStr VARCHAR(4000) = 'Bravo/Alpha/Bravo/Tango/Delta/Bravo/Papa/Delta/Charlie/Bravo/Delta/Echo/Charlie/Juliet/Charlie/Foxtrot/Foxtrot/Yankee/Juliet/Golf/Alpha/Foxtrot/Golf/India/Delta/Zulu/Quebec/Alpha/Bravo/Tango/Hotel/Victor/Alpha/Delta/Charlie/Hotel/Delta/Echo/Charlie/Echo/Charlie/Foxtrot/Foxtrot/Alpha/Bravo/Golf/Alpha/Foxtrot/Golf/India/Delta/Zulu/Kilo/Alpha/Bravo/Tango/Delta/Romeo/Alpha/Delta/Uniform/Sierra/Delta/Whisky/Charlie/Papa/Charlie/Foxtrot/Foxtrot/Alpha/Xray/Golf/Victor/Foxtrot/Golf/India/Delta/Kilo/Bravo/Lima/Bravo/Tango/Delta/Bravo/Alpha/Delta/Papa/Lima/Delta/Quebec/Romeo/Kilo/Charlie/Foxtrot/Foxtrot/Alpha/Tango/Golf/Alpha/Lima/Golf/India/Victor/Zulu/Bravo/Alpha/Mike/Tango/Delta/Bravo/Alpha/Delta/Yankee/Bravo/Delta/Echo/Whisky/Echo/Charlie/Mike/Foxtrot/Alpha/Bravo/Golf/Alpha/Foxtrot/Golf/Xray/Delta/Mike/Bravo/Alpha/Bravo/November/Delta/Bravo/Alpha/November/Charlie/Bravo/Delta/Echo/Xray/Echo/November/Foxtrot/Foxtrot/Alpha/Bravo/Golf/Alpha/Foxtrot/Golf/India/Delta/Zulu/Quebec/Alpha/Xray/Tango/Oscar/Bravo/Alpha/Oscar/Oscar/Bravo/Delta/Echo/Charlie/Echo/Charlie/Foxtrot/Foxtrot/Alpha/Bravo/Golf/Alpha/Foxtrot/Golf/India/Delta/Zulu/Bravo/Alpha/Romeo/Yankee/Delta/Whisky/Alpha/Delta/Victor/Bravo/Delta/Tango/Charlie/Uniform/Charlie/Foxtrot/Foxtrot/Quebec/Bravo/Romeo/Alpha/Foxtrot/Golf/India/Delta/Zulu/Bravo/Alpha/Xray/Tango/Delta/Bravo/Alpha/Delta/Quebec/Bravo/Delta/Tango/Charlie/Echo/Charlie/Foxtrot/Tango/Quebec/Bravo/Romeo/Alpha/Foxtrot/Golf/India/Delta/Zulu/Bravo/Alpha/Romeo/Tango/Delta/Bravo/Uniform/Delta/Quebec/Bravo/Delta/Tango/Charlie/Echo/Charlie/Whisky/Foxtrot/Quebec/Bravo/Romeo/Alpha/Foxtrot/Golf/India/Sierra/Yankee/Bravo/Alpha/Romeo/Tango/Delta/Bravo/Alpha/Delta/Quebec/Bravo/Delta/Sierra/Charlie/Echo/Charlie/Foxtrot/Foxtrot/Quebec/Juliet/Romeo/Alpha/Foxtrot/Yankee/India/Delta/Zulu/Uniform/Alpha/Romeo/Tango/Delta/Bravo/Alpha/Delta/Quebec/Bravo/Xray/Tango/Charlie/Echo/Victor/Foxtrot/Victor/Quebec/Yankee/Romeo/Alpha/Tango/Yankee/India/Delta/Whisky/'

, @strDelimiter CHAR(1) = '/'

--, @maintainOrder BIT = 1; 

 

SELECT OriginalOrder  =dbo.ufn_DedupeString(@dupeStr, @strDelimiter, 1),

       AlphabeticOrder=dbo.ufn_DedupeString(@dupeStr, @strDelimiter, 0);

 


Wednesday, January 20, 2016 - 11:19:22 AM - Aaron Bertrand Back To Top

 Steve, also, half of my engineering is the reassemble function, which makes it very easy to pass in the delimited string and get a delimited string back, instead of all the code the user will have to write each time with your version. I thought about wrapping your code in a function, but because you have DDL with the #temp table, it can't be done. Maybe I'll try with a table variable instead of a #temp table.

 


Wednesday, January 20, 2016 - 11:09:03 AM - Aaron Bertrand Back To Top

Steve, I think you will find that it is over-engineered because it will prove to give better performance than a looping/iteration function; I also missed how yours guarantees that the returned string will be generated in the same order (or how a user could call the function and ask for alphabetical instead).


Tuesday, January 19, 2016 - 5:26:52 PM - Steve Mangiameli Back To Top

It seems to me this is over-engineered.  Parse the list into a table of distinct values and put it back together again via loop.

 

Parsing function

/********************************************************************************

PROCEDURE NAME   : gfn_ParseList

AUTHOR           : 

CREATED          : 02/14/2012

DESCRIPTION      : Takes a delineated list and parses it out.  Accepts a string and a delineator; assumes comma (',')

CALLED BY        : 

DEPENDS ON       : 

EXAMPLE          : gfn_ParseList (@assesmentIDs, ',')

 

DATE         NAME DESCRIPTION

02/14/2012 steve.mangiameli     Creation

********************************************************************************/

CREATE FUNCTION gfn_ParseList (@strToPars VARCHAR(8000), @parseChar CHAR(1))

RETURNS @parsedIDs TABLE

(ParsedValue VARCHAR(255))

AS

BEGIN

 

DECLARE 

@startPos INT = 0

, @strLen INT = 0

 

WHILE LEN(@strToPars) >= @startPos

BEGIN

 

IF (SELECT CHARINDEX(@parseChar,@strToPars,(@startPos+1))) > @startPos

SELECT @strLen  = CHARINDEX(@parseChar,@strToPars,(@startPos+1))- @startPos

ELSE

BEGIN

SET @strLen = LEN(@strToPars) - (@startPos -1)

 

INSERT @parsedIDs

SELECT RTRIM(LTRIM(SUBSTRING(@strToPars,@startPos, @strLen)))

BREAK

END

 

SELECT @strLen  = CHARINDEX(@parseChar,@strToPars,(@startPos+1))- @startPos

 

INSERT @parsedIDs

SELECT RTRIM(LTRIM(SUBSTRING(@strToPars,@startPos, @strLen)))

 

SET @startPos = @startPos+@strLen+1

END

RETURN

END

  

Loop

DECLARE 

 @dupeStr VARCHAR(4000) = 'Bravo/Alpha/Bravo/Tango/Delta/Bravo/Alpha/Delta'

, @strDelimiter CHAR(1) = '/'

 

IF OBJECT_ID('tempdb..#DeDupeMe','U') IS NOT NULL

DROP TABLE #DeDupeMe

 

SELECT DISTINCT ParsedValue 

INTO #DeDupeMe

FROM gfn_ParseList(@dupeStr,@strDelimiter)

 

DECLARE 

 @valCount INT

, @curValue VARCHAR(255) =''

, @dedupedStr VARCHAR(4000)=''

 

SELECT @valCount = COUNT(1) FROM #DeDupeMe

 

WHILE @valCount > 0

BEGIN

SELECT @curValue=MIN(ParsedValue)

FROM #DeDupeMe

WHERE ParsedValue > @curValue

 

SET @dedupedStr+=@curValue

SET @valCount-=1

 

IF @valCount > 0

SET @dedupedStr+='/'

END

 

SELECT @dedupedStr DeDupe_String

 


Monday, January 18, 2016 - 9:11:47 AM - J Back To Top

 It's splitting hairs, but page compression doesn't buy you anything in this case over row compression. I believe SQL Server is smart enough to not bother trying, but I'm not 100% sure because, rather than rely on it, I'd specify row compression.

 


Thursday, January 14, 2016 - 9:33:46 AM - Aaron Bertrand Back To Top

 

Paul, yes, you're right. Most of my code library has a hard-coded delimiter or only accepts char(1) because, in the real world, how often are we using multi-character delimiters? :-) But point taken, I'll try to get the script updated.


Thursday, January 14, 2016 - 6:30:16 AM - Paul Back To Top

Great code! Thanks.

One tiny error: The last '1' in your code should be len(@delim), so that it works with delimiters of other length than 1 too.

     FOR XML PATH, TYPE).value(N'.[1]',N'nvarchar(max)'),1,len(@delim),N'') 

 


Learn more about SQL Server tools