mssqltips logo

SQL Server STRING SPLIT Limitations

By:   |   Updated: 2019-09-03   |   Comments (3)   |   Related: More > Functions - User Defined UDF

Problem

I’ve written here about two known limitations of the STRING_SPLIT function, added in SQL Server 2016.

The first issue is that the output order is not guaranteed to match the input order (so a string bob,mack,foo may get split apart and come out of the function in the order mack, foo, bob); I suggested some ways to work around that in Solve old problems with SQL Server’s new STRING_AGG and STRING_SPLIT functions.

The second is that the separator you split by is limited to a single character; I addressed this in Dealing with the single-character delimiter in SQL Server's STRING_SPLIT function.

A third limitation that I recently discovered is that not all single characters are valid input to the separator argument of the function. This can be important to know if you are struggling to find a delimiter that can’t possibly exist inside the source data you’re concatenating and splitting.

Solution

Before we talk about any kind of workaround, let’s start with how I got here, since I stumbled upon this issue accidentally. I was actually reworking a split function I wrote, long before STRING_SPLIT existed, so that it could now handle a space as the separator. So I re-wrote the function as follows, changing LEN() to DATALENGTH()/2 so that the “trailing” space would be ignored:

CREATE FUNCTION dbo.SplitString
(
@List nvarchar(max),
@Delim nvarchar(255)
)
RETURNS TABLE WITH SCHEMABINDING
AS
RETURN
(
WITH n(n) AS (SELECT 1 UNION ALL SELECT n+1
FROM n WHERE n <= LEN(@List)
)
SELECT [Value] = SUBSTRING(@List, n, CHARINDEX(@Delim, @List + @Delim, n) - n)
FROM n
WHERE n <= LEN(@List)
AND SUBSTRING(@Delim + @List, n, DATALENGTH(@Delim)/2) = @Delim
);

I tested the function, to make sure spaces now work, and that all the other characters I had tested before continue to work:

SELECT [value] FROM dbo.SplitString(N'mack,foo,bob', N',');
SELECT [value] FROM dbo.SplitString(N'mack|foo|bob', N'|');
SELECT [value] FROM dbo.SplitString(N'mack~foo~bob', N'~');
SELECT [value] FROM dbo.SplitString(N'mack foo bob', N' ');			

These all worked correctly, as I suspected. Then I thought, let’s try some crazy characters from the upper bounds of the character set, to make sure that other delimiters would also be supported (screenshot because I know a web page won’t support all of these characters):

Sample code testing different Unicode characters

As indicated in the comments, some worked, but some failed. When they failed, it was always with the following error message:

Msg 537, Level 16, State 3
Invalid length parameter passed to the LEFT or SUBSTRING function.

So I tried reverse engineering the function, and determined that even though these troublesome characters had the same datalength and similar varbinary conversion to those that worked, they were failing to get parsed correctly by CHARINDEX (the function would return the wrong location within the string). I thought, that is interesting, and immediately wondered two things: (1) how could I compile a list of all the troublesome characters, and (2) does the same problem affect the built-in function, STRING_SPLIT?

I quickly assembled a test to tackle both simultaneously. First, I checked the behavior of STRING_SPLIT with one of the troublesome characters (again, using a screenshot to avoid translation issues):

Showing an unexpected result from STRING_SPLIT()

The real test would be checking all of the characters (well, from NCHAR(0) through NCHAR(65536), anyway):

SET NOCOUNT ON;
CREATE TABLE #stuff
(
i int PRIMARY KEY,
c nchar(1),
result int,
details nvarchar(2048)
); DECLARE @sep nchar(1), @list nvarchar(max), @i int = 0; WHILE @i <= 65536
BEGIN SET @sep = NCHAR(@i); SET @list = N'mack' + @sep + N'foo' + @sep + N'bob'; BEGIN TRY INSERT #stuff(i,c,result) SELECT @i, NCHAR(@i),
CASE (SELECT COUNT(*) FROM STRING_SPLIT(@list,@sep))
WHEN 3 THEN 1 -- success! 3 is expected
WHEN 1 THEN 0 -- failure! 1 means the string was not split
ELSE -1 -- neither 3 nor 1! Something else happened
END; END TRY BEGIN CATCH INSERT #stuff(i,c,result,details) SELECT @i, NCHAR(@i),
-2, -- a real exception, so let's find out why:
ERROR_MESSAGE(); END CATCH SET @i += 1; END

First thing I wanted to inspect: how many characters worked?

SELECT result, breakdown = COUNT(*) 
FROM #stuff
GROUP BY result
ORDER BY result DESC;

Results:

Breakdown of results from 65,537 character tests

The good news? 43,927 characters worked! The bad news?  21,535 characters did not work. (Note that your result may differ if you are not using the default instance collation from SQL Server setup.) In addition,  29 had some other issue and 1 caused an error in the script.

Let’s look at the outright failure first:

SELECT i, c, result, details FROM #stuff WHERE result = -2;			

Results:

Error result from NCHAR(65536)

I’m okay with this. I was pushing the upper bound anyway, and should have realized that NCHAR(65536) would yield NULL.

Next up, the ones that didn’t yield the correct count or a single value:

SELECT i, c FROM #stuff WHERE result = -1;			

Results:

Characters which resulted in too many strings

Of course, these are the regular alphanumeric characters (or their codepage equivalents in some way) that appear in the data, so of course if you’re splitting on a separator that is also in the data, you’re going to end up with more segments than you would have expected.

And finally, the ones that didn’t split at all (WHERE result = 0), were a whole mix of things. NCHAR(0) was no surprise, same with all the characters that yield empty boxes or question marks. But there are others I can’t explain, like NCHAR(173) (-), NCHAR(706) (), and NCHAR(1421) (which is some kind of pinwheel character). Ranges that are safe to use: 506 – 535, 1377 – 1414, and 63744 - 64045.

I talked to some other folks about this, and Erland suggested trying a different (newer) collation, since more code points are defined there. So I changed the code to the following:

SET NOCOUNT ON;
CREATE TABLE #stuff2
(
i int PRIMARY KEY,
c nchar(1),
result int,
details nvarchar(2048)
); DECLARE @sep nchar(1), @list nvarchar(max), @i int = 0; WHILE @i <= 65535
BEGIN SET @sep = NCHAR(@i); SET @list = N'mack' + @sep + N'foo' + @sep + N'bob'; BEGIN TRY INSERT #stuff2(i,c,result) SELECT @i, NCHAR(@i),
CASE (SELECT COUNT(*) FROM STRING_SPLIT(@list COLLATE Latin1_General_100_CI_AI,@sep))
WHEN 3 THEN 1 -- success! 3 is expected
WHEN 1 THEN 0 -- failure! 1 means the string was not split
ELSE -1 -- neither 3 nor 1! Something else happened
END; END TRY BEGIN CATCH INSERT #stuff2(i,c,result,details) SELECT @i, NCHAR(@i),
-2, -- a real exception, so let's find out why:
ERROR_MESSAGE(); END CATCH SET @i += 1; END

This was better, but not perfect. We still have over 6,300 characters that can’t be reliably used as a separator to split at all, and 300 characters that now yield more than 3 results in their output:

Breakdown of results from 65,536 character tests

Summary

The short version of this cautionary tale: don’t pick random separators; do your research, and make sure you pick a separator that will both behave as you expect and cannot possibly appear in the data. This holds true whether you are using the built-in STRING_SPLIT (with or without COLLATE) or, as I have also demonstrated, a custom T-SQL split function. And this isn’t a solution, of course; it’s just my way of dealing with the problem – identifying characters that can or can’t be used as delimiters – without obtaining a PhD in Unicode, collations, and code pages to help me understand why.

Next Steps

Read on for related tips and other resources:



Last Updated: 2019-09-03


get scripts

next tip button



About the author
MSSQLTips author Aaron Bertrand Aaron Bertrand (@AaronBertrand) is a Product Manager at SentryOne, with industry experience dating back to Classic ASP and SQL Server 6.5. He is editor-in-chief of the performance-related blog, SQLPerformance.com, and serves as a community moderator for the Database Administrators Stack Exchange.

View all my tips




Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name
*Email
Email me updates

Signup for our newsletter

I agree by submitting my data to receive communications, account updates and/or special offers about SQL Server from MSSQLTips and/or its Sponsors. I have read the privacy statement and understand I may unsubscribe at any time.





Friday, September 06, 2019 - 4:05:21 PM - Brian Gale Back To Top

Just for curiosity sake, I decided to look into this a little bit and found something potentially interesting.

When using the last script you had, where you have 300 results of "neither 3 or 1", I broke this down a bit more to show how many results were found.  There are 176 cases where there are 4 results, 28 cases where there are 5 and 96 cases where there are 6.  Looking at JUST the cases where there are 5, the 5 results are:
mack
foo
<blank>
o
<blank>

So it looks like the character is being interpreted as a "b".  Looking at the characters that have this, they all are B shaped characters so I do kind of understand why you get 5 results, but at the same time find it odd.  Looking at the ones with 6, it is now a character with the O shape so we are getting Mack, f, blank, blank, b and b (not sure why there is 2 blanks after f, but no blank after the first b, but that is what it shows).  The ones that bring back 4 results are ones that have only 1 character being pulled out in the input string which is expected based on the previous 2.
Looking at the pattern (characters that LOOK like the letters in the source string) I can see that it is treating them like a separator, I just don't understand why it is doing this.

Some of them make sense (like NCHAR(75) which is a captial K and you are in case insensitive collation), but others I am not certain on.  Switching to a case sensitive collation gives you more successes (59064), same number of failures (6310) and a fewer number of "something else happened" (162).

But your script is pretty awesome for anyone looking to use the STRING_SPLIT function to split up a string to ensure that you are not going to break or get incorrect results by picking the wrong value for the separator!


Wednesday, September 04, 2019 - 8:37:08 AM - Aaron Bertrand Back To Top

@Vasyl the error only occurs when the delimiter used changes the actual meaning of Len/datalength, so I don’t believe it’s a generic error involving the number of elements.

Also, yes, recursion is a problem for strings longer than 99 characters. I wasn’t providing a full string splitting solution here (there are dozens of articles that already cover that), I was limiting to strings only long enough to demonstrate the problem.


Wednesday, September 04, 2019 - 4:16:34 AM - Vasyl Pavuk Back To Top

Dear Aaron.

You described that sometime you have errors like the following:

Invalid length parameter passed to the LEFT or SUBSTRING function.

Thats probably happend because recursive CTE named as "n" returns len(@List) + 1 values.

You may have second possible source of problems - number of recursive iterations, because max of recursion iterations is 100 by default. So, you may have problems for strings with length over 99.

Kind regards,

Vasyl Pavuk.



download

























get free sql tips

I agree by submitting my data to receive communications, account updates and/or special offers about SQL Server from MSSQLTips and/or its Sponsors. I have read the privacy statement and understand I may unsubscribe at any time.



Learn more about SQL Server tools