The SQL Server Numbers Table, Explained - Part 2

By:   |   Comments (1)   |   Related: 1 | 2 | 3 | 4 | 5 | > TSQL


Problem

In Part 1 of this tip, I showed how a Numbers table works, and demonstrated some examples including generating time series and sets of dates. I saved a couple of more advanced examples for this second segment.

Solution

Just to recap from Part 1, this tip will explore solutions to a couple of common problems using the following Numbers table:

DECLARE @UpperBound INT = 1000000;

;WITH cteN(Number) AS
(
  SELECT ROW_NUMBER() OVER (ORDER BY s1.[object_id]) - 1
  FROM sys.all_columns AS s1
  CROSS JOIN sys.all_columns AS s2
)
SELECT [Number] INTO dbo.Numbers
FROM cteN WHERE [Number] <= @UpperBound;

CREATE UNIQUE CLUSTERED INDEX CIX_Number ON dbo.Numbers([Number])
WITH 
(
  FILLFACTOR = 100,      -- in the event server default has been changed
  DATA_COMPRESSION = ROW -- if Enterprise & table large enough to matter
);

Identifying Gaps

The first example in Part 1 showed how to generate a new set of sequential data. Something else that comes up from time to time is the need to find gaps in existing data. Let's say you have a table with an IDENTITY column, and you need to quickly see which IDENTITY values are missing (which could be due to rollbacks, deletes, or this "feature"):

CREATE TABLE dbo.SampleOrders
(
  OrderID INT IDENTITY(1,1) PRIMARY KEY
);

INSERT dbo.SampleOrders DEFAULT VALUES; -- 1
INSERT dbo.SampleOrders DEFAULT VALUES; -- 2

BEGIN TRANSACTION;
INSERT dbo.SampleOrders DEFAULT VALUES; -- 3, will be rolled back
INSERT dbo.SampleOrders DEFAULT VALUES; -- 4, will be rolled back
ROLLBACK TRANSACTION;

INSERT dbo.SampleOrders DEFAULT VALUES; -- 5
INSERT dbo.SampleOrders DEFAULT VALUES; -- 6, will be deleted
DELETE dbo.SampleOrders    WHERE OrderID = 6;
INSERT dbo.SampleOrders DEFAULT VALUES; -- 7

After running this, if we just select from the table, we see:

1
2
5
7

What we want to identify in our query, though, are the values that are missing (3, 4, and 6). Since we know we have all of these numbers in our Numbers table, we can simply perform an OUTER JOIN, limiting the scan against the numbers table to the maximum value we found in our orders table:

DECLARE @UpperBound INT = (SELECT MAX(OrderID) FROM dbo.SampleOrders);
-- you may want a @LowerBound with MIN() as well

SELECT MissingOrderID = n.Number
  FROM dbo.Numbers AS n
  LEFT OUTER JOIN dbo.SampleOrders AS s
  ON n.Number = s.OrderID
  WHERE s.OrderID IS NULL
    AND n.Number >= 1 -- or > @LowerBound
    AND n.Number < @UpperBound;

This returns all of the values in the Numbers table that weren't found in the orders table (not including future values):

3
4
6

Since we also know that the values for the columns in both tables are unique by definition, we can use EXCEPT to get the same results with a slightly simpler plan (but of course always test in your own environment and against your own data):

DECLARE @UpperBound INT = (SELECT MAX(OrderID) FROM dbo.SampleOrders);
-- you may want a @LowerBound with MIN() as well

SELECT MissingOrderID = n.Number
  FROM dbo.Numbers AS n
  WHERE n.Number >= 1
    AND n.Number < @UpperBound
EXCEPT
SELECT OrderID FROM dbo.SampleOrders;

EXCEPT has an implicit distinct, as I explain in this post about anti-semi-joins, so do be careful about where you use it, and validate that your output is still what you expect.

To identify gaps in a set of data, your Numbers table would have to have enough rows to cover both all of the rows selected from the source table *and* the number of missing values.

Filling Gaps

Another scenario that crops up a lot is when a report yields gaps. You are showing the sales for all of the days in the month, or all of the months in the year, and perhaps at some of those points you had no sales. Simplifying a little bit, let's say you have the following table and sample data, and want a report showing total sales by day for February 1st through February 3rd:

CREATE TABLE dbo.SampleSales
(
  OrderDate DATETIME,
  Total DECIMAL(11,2)
);

-- sample data only has sales for Feb 1 and Feb 3:
INSERT dbo.SampleSales(OrderDate, Total) VALUES
('20160201 03:45', 40),('20160201 13:44', 10),
('20160203 04:52', 32),('20160203 06:55', 80);

The typical first attempt at such a report will look something like this:

DECLARE @StartDate DATE = '20160201', @EndDate DATE = '20160203';

SELECT 
  OrderDate = CONVERT(DATE, OrderDate), 
  TotalSales = SUM(Total)
FROM dbo.SampleSales
WHERE OrderDate >= @StartDate
  AND OrderDate < DATEADD(DAY, 1, @EndDate)
GROUP BY CONVERT(DATE, OrderDate)
ORDER BY OrderDate;

However, the output of this report looks like this:

OrderDate     TotalSales
----------    ----------
2016-02-01         50.00
2016-02-03        112.00

In other words, there is no row for February 2nd. In such a simplistic case, we can mentally fill in the gap of course, but extrapolate this to longer date ranges, or cases where you're grouping by salesperson, or region, or product - you want those rows with zero sales to show up in the report.

Here is one way (and I can assure you there are many) to fill in those missing dates. We grab all of the days in the desired range by adding days to the start date from our Numbers table, and then perform an outer join against the sales table. Before we get to the actual results (grouped by date), let's take an initial look at how just the join looks:

SELECT
  n.Number,
  LeftRange  = DATEADD(DAY, n.Number,     @StartDate),
  RightRange = DATEADD(DAY, n.Number + 1, @StartDate),
  s.OrderDate,
  s.Total
FROM dbo.Numbers AS n
LEFT OUTER JOIN dbo.SampleSales AS s
ON s.OrderDate >= DATEADD(DAY, n.Number,     @StartDate)
AND s.OrderDate < DATEADD(DAY, n.Number + 1, @StartDate)
WHERE n.Number <= DATEDIFF(DAY, @StartDate, @EndDate)
ORDER BY n.Number;

The output shows the value from the Numbers table, the beginning of the date range we need to join on, the next day (which serves as the end of the range we need to join on), and then the order date and total from the sales table that match the join conditions. As you can see, we get 2 rows for each of the days with sales, and a single row for the date that is missing from the sales table (and in that row, everything coming from the sales table is NULL):

Number    LeftRange     RightRange    OrderDate                  Total
------    ----------    ----------    -----------------------    -----
0         2016-02-01    2016-02-02    2016-02-01 03:45:00.000    40.00
0         2016-02-01    2016-02-02    2016-02-01 13:44:00.000	 10.00
1         2016-02-02    2016-02-03    NULL                       NULL
2         2016-02-03    2016-02-04    2016-02-03 04:52:00.000    32.00
2         2016-02-03    2016-02-04    2016-02-03 06:55:00.000    80.00

So to get the report we really want, we simply need to adjust this query by removing some of the output columns, then grouping by the beginning of each date range and summing the sales (using COALESCE in order to show 0 instead of NULL):

SELECT 
  OrderDate = DATEADD(DAY, n.Number, @StartDate),
  TotalSales = COALESCE(SUM(s.Total), 0)
FROM dbo.Numbers AS n
LEFT OUTER JOIN dbo.SampleSales AS s
ON s.OrderDate >= DATEADD(DAY, n.Number, @StartDate)
AND s.OrderDate < DATEADD(DAY, n.Number + 1, @StartDate)
WHERE n.Number <= DATEDIFF(DAY, @StartDate, @EndDate)
GROUP BY n.Number
ORDER BY n.Number;

Now the results are the ones we wanted, with a row for February 2nd:

OrderDate     TotalSales
----------    ----------
2016-02-01         50.00
2016-02-02          0.00
2016-02-03        112.00

This does not yield a better-looking plan in this contrived case, but I think it will beat many of the alternatives you might use to fill in that missing row (for example, a loop that steps through all of the dates in the sales table, stores the ones it doesn't find in a temporary table, and then unions that with the grouping query above).

As with other problems, for this task your Numbers table would need to be large enough to cover every distinct row possible in the output (including the missing rows).

Splitting Strings

By far the most common use I've had for a Numbers table is to split strings. While an atrocity like a comma-separated string flies in the face of relational databases, and many would argue it has no place inside of SQL Server, the reality is that this is something we're just going to have to get used to dealing with. So, I'm going to construct a very simple example; we have the following string:

DECLARE @List VARCHAR(4000) = 'Mark,Sindy,Rob,Casey';

Regardless of where this list came from or what the output will be used for, assume that we have been charged with producing a set from this using T-SQL. The output should look like this:

Name
-----
Mark
Sindy
Rob
Casey

Over the years, I have found that most people approach this problem with a very programmatic and iterative mindset. They proceed with a loop that goes through the entire string, one character at a time, and builds the next value until it comes across the next delimiter. I have seen this solution implemented time and time again, and is almost always a source of performance issues:

DECLARE @List VARCHAR(4000) = 'Mark,Sindy,Rob,Casey', @Delimiter CHAR(1) = ',';

DECLARE @chunks TABLE(chunk VARCHAR(4000));

DECLARE @idx SMALLINT = 1, 
        @char CHAR(1), 
        @chunk VARCHAR(4000) = '';

WHILE @idx <= LEN(@List)
BEGIN
  SET @char = SUBSTRING(@List, @idx, 1);
  IF @char = @Delimiter
  BEGIN
    IF @chunk > ''
      INSERT @chunks(chunk) SELECT @chunk;
    SET @chunk = '';
  END
  ELSE
  BEGIN
    SET @chunk += @char;
  END
  SET @idx += 1;
END
IF @chunk > ''
  INSERT @chunks(chunk) SELECT @chunk;
  
SELECT Name = chunk FROM @chunks;

Another approach I have seen is a little better, by processing whole words at a time instead of building the individual elements character by character, but still sub-optimal:

DECLARE @List VARCHAR(4000) = 'Mark,Sindy,Rob,Casey', @Delimiter CHAR(1) = ',';

DECLARE @chunks TABLE(chunk VARCHAR(4000));

DECLARE @idx SMALLINT,
        @pos SMALLINT;

WHILE LEN(@List) > 0
BEGIN
    SET @idx = CHARINDEX(@Delimiter, @List);

    IF @idx = 0
    BEGIN
      INSERT @chunks(chunk) SELECT @List;
      BREAK;
    END
    ELSE
    BEGIN
      INSERT @chunks(chunk) SELECT SUBSTRING(@List, 1,@idx - 1);
      SET @pos = @idx + 1;
      SET @List = SUBSTRING(@List, @pos , LEN(@List) - @pos + 1);
    END
END

SELECT Name = chunk FROM @chunks;

I'm not going to explain these routines in any depth because they are fairly straightforward to follow and, performance aside, they get the desired results. I'm sure there are dozens of other approaches that follow the same type of loop-based thinking, with slight variations in implementation; also, beware of any clever-looking XML-based splitting mechanisms, as they tend to carry a lot of additional overhead, even though there might not be any loop-based logic visible (evidence here).

The approach I like that uses a Numbers table is rather concise, but I am going to expand it here so that I can explain how it works. Starting with the same list of strings, we're going to join the list of strings to the Number table, but only at each location where a comma is located. This means that the join will produce one row for each comma. Then all we have to do is:

DECLARE @List VARCHAR(4000) = 'MarkBob', @Delimiter CHAR(1) = ',';

SELECT
  List = @List,
  BeginningOfElement = Number, 
  NextDelimiter = CHARINDEX(@Delimiter, @List + @Delimiter, Number)
FROM dbo.Numbers
WHERE Number <= LEN(@List)
AND CHARINDEX(@Delimiter, @Delimiter + @List, Number) = Number;

Inspecting the output, we can see that the join allows us to identify the location after each comma (and we prepend one using @Delimiter + List), and the location of each subsequent comma (and we append one using @List + @Delimiter):

List                    BeginningOfElement    NextDelimiter
--------------------    ------------------    -------------
Mark,Sindy,Rob,Casey    1                     5
Mark,Sindy,Rob,Casey    6                     11
Mark,Sindy,Rob,Casey    12                    15
Mark,Sindy,Rob,Casey    16                    21

I exposed List in the output as well, so that it would be visually clear that, on each row, you could take the substring of List, from BeginningOfElement to NextDelimiter - 1, and this would produce the string in that location. To visualize further:

                     List                    BeginningOfElement    NextDelimiter
             --------------------            ------------------    -------------
             Mark,Sindy,Rob,Casey            1                     5
-- char 1 ---^    ^--- char 5
             Mark,Sindy,Rob,Casey            6                     11
     -- char 6 ---^    ^--- char 11
             Mark,Sindy,Rob,Casey            12                    15
          -- char 12 ---^  ^--- char 15
             Mark,Sindy,Rob,Casey            16                    21
              -- char 16 ---^    ^--- char 21

So, we can take the query above, stuff it into a subquery or CTE, and then extract the relevant parts of each string.

;WITH cteW AS
(
  SELECT
    List = @List,
    BeginningOfElement = Number, 
    NextDelimiter = CHARINDEX(@Delimiter, @List + @Delimiter, Number)
  FROM dbo.Numbers
  WHERE Number <= LEN(@List)
  AND CHARINDEX(@Delimiter, @Delimiter + @List, Number) = Number
)
SELECT Name = SUBSTRING(@List, BeginningOfElement, NextDelimiter - BeginningOfElement)
FROM cteW;

This could be made even more concise, removing the CTE and friendly output column names, at a measured cost of readability:

SELECT Name = SUBSTRING(@List, Number, 
    CHARINDEX(@Delimiter, @List + @Delimiter, Number) - Number)
  FROM dbo.Numbers
  WHERE Number <= LEN(@List)
  AND CHARINDEX(@Delimiter, @Delimiter + @List, Number) = Number;

For splitting strings, your numbers table would have to have enough numbers to cover the length of the longest supported string. This will commonly be limited to 4,000 (for Unicode) or 8,000 (for ASCII), but if you need to support MAX types, you'll need to decide if you really want to allow strings as large as 1 or 2 billion characters - there is probably a practical limit somewhere that you can arrive at.

This is the foundation for a Numbers table-based string splitting mechanism, but ignores a lot of items for simplicity - for example, dealing with delimiters that have more than one character, using Unicode strings or MAX types, returning the results in a specific order, eliminating duplicates, and handling edge cases like empty elements, single-element variables, leading and trailing spaces, and so on. I may deal with these issues in a future tip, but to get more details on some of them in the meantime, see Removing Duplicates from Strings in SQL Server. (Though honestly, if you are splitting strings often, there are even better alternatives, such as much more powerful string handling in CLR, or using table-valued parameters to avoid splitting in the first place.)

Conclusion

In this pair of tips, I set out to explain how a Numbers table works, and to give a bit more detail about a few of the more common problems it can solve. A lot of people object to the storage and memory requirements of another table, but I hope I have shown that it can be a useful and worthwhile weapon in your arsenal.

Next Steps


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Aaron Bertrand Aaron Bertrand (@AaronBertrand) is a passionate technologist with industry experience dating back to Classic ASP and SQL Server 6.5. He is editor-in-chief of the performance-related blog, SQLPerformance.com, and also blogs at sqlblog.org.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips



Comments For This Article




Friday, February 26, 2016 - 2:06:35 PM - Shafiq Back To Top (40809)

I like the parsing comma delimited string with number. What a mind boggling tip? Keep it up

 















get free sql tips
agree to terms