By: Aaron Bertrand | Last Updated: 2016-02-26 | Comments (1) | T-SQL
In Part 1 of this tip, I showed how a Numbers table works, and demonstrated some examples including generating time series and sets of dates. I saved a couple of more advanced examples for this second segment.
Just to recap from Part 1, this tip will explore solutions to a couple of common problems using the following Numbers table:
DECLARE @UpperBound INT = 1000000; ;WITH cteN(Number) AS ( SELECT ROW_NUMBER() OVER (ORDER BY s1.[object_id]) - 1 FROM sys.all_columns AS s1 CROSS JOIN sys.all_columns AS s2 ) SELECT [Number] INTO dbo.Numbers FROM cteN WHERE [Number] <= @UpperBound; CREATE UNIQUE CLUSTERED INDEX CIX_Number ON dbo.Numbers([Number]) WITH ( FILLFACTOR = 100, -- in the event server default has been changed DATA_COMPRESSION = ROW -- if Enterprise & table large enough to matter );
The first example in Part 1 showed how to generate a new set of sequential data. Something else that comes up from time to time is the need to find gaps in existing data. Let's say you have a table with an IDENTITY column, and you need to quickly see which IDENTITY values are missing (which could be due to rollbacks, deletes, or this "feature"):
CREATE TABLE dbo.SampleOrders ( OrderID INT IDENTITY(1,1) PRIMARY KEY ); INSERT dbo.SampleOrders DEFAULT VALUES; -- 1 INSERT dbo.SampleOrders DEFAULT VALUES; -- 2 BEGIN TRANSACTION; INSERT dbo.SampleOrders DEFAULT VALUES; -- 3, will be rolled back INSERT dbo.SampleOrders DEFAULT VALUES; -- 4, will be rolled back ROLLBACK TRANSACTION; INSERT dbo.SampleOrders DEFAULT VALUES; -- 5 INSERT dbo.SampleOrders DEFAULT VALUES; -- 6, will be deleted DELETE dbo.SampleOrders WHERE OrderID = 6; INSERT dbo.SampleOrders DEFAULT VALUES; -- 7
After running this, if we just select from the table, we see:
1 2 5 7
What we want to identify in our query, though, are the values that are missing (3, 4, and 6). Since we know we have all of these numbers in our Numbers table, we can simply perform an OUTER JOIN, limiting the scan against the numbers table to the maximum value we found in our orders table:
DECLARE @UpperBound INT = (SELECT MAX(OrderID) FROM dbo.SampleOrders); -- you may want a @LowerBound with MIN() as well SELECT MissingOrderID = n.Number FROM dbo.Numbers AS n LEFT OUTER JOIN dbo.SampleOrders AS s ON n.Number = s.OrderID WHERE s.OrderID IS NULL AND n.Number >= 1 -- or > @LowerBound AND n.Number < @UpperBound;
This returns all of the values in the Numbers table that weren't found in the orders table (not including future values):
3 4 6
Since we also know that the values for the columns in both tables are unique by definition, we can use EXCEPT to get the same results with a slightly simpler plan (but of course always test in your own environment and against your own data):
DECLARE @UpperBound INT = (SELECT MAX(OrderID) FROM dbo.SampleOrders); -- you may want a @LowerBound with MIN() as well SELECT MissingOrderID = n.Number FROM dbo.Numbers AS n WHERE n.Number >= 1 AND n.Number < @UpperBound EXCEPT SELECT OrderID FROM dbo.SampleOrders;
EXCEPT has an implicit distinct, as I explain in this post about anti-semi-joins, so do be careful about where you use it, and validate that your output is still what you expect.
To identify gaps in a set of data, your Numbers table would have to have enough rows to cover both all of the rows selected from the source table *and* the number of missing values.
Another scenario that crops up a lot is when a report yields gaps. You are showing the sales for all of the days in the month, or all of the months in the year, and perhaps at some of those points you had no sales. Simplifying a little bit, let's say you have the following table and sample data, and want a report showing total sales by day for February 1st through February 3rd:
CREATE TABLE dbo.SampleSales ( OrderDate DATETIME, Total DECIMAL(11,2) ); -- sample data only has sales for Feb 1 and Feb 3: INSERT dbo.SampleSales(OrderDate, Total) VALUES ('20160201 03:45', 40),('20160201 13:44', 10), ('20160203 04:52', 32),('20160203 06:55', 80);
The typical first attempt at such a report will look something like this:
DECLARE @StartDate DATE = '20160201', @EndDate DATE = '20160203'; SELECT OrderDate = CONVERT(DATE, OrderDate), TotalSales = SUM(Total) FROM dbo.SampleSales WHERE OrderDate >= @StartDate AND OrderDate < DATEADD(DAY, 1, @EndDate) GROUP BY CONVERT(DATE, OrderDate) ORDER BY OrderDate;
However, the output of this report looks like this:
OrderDate TotalSales ---------- ---------- 2016-02-01 50.00 2016-02-03 112.00
In other words, there is no row for February 2nd. In such a simplistic case, we can mentally fill in the gap of course, but extrapolate this to longer date ranges, or cases where you're grouping by salesperson, or region, or product - you want those rows with zero sales to show up in the report.
Here is one way (and I can assure you there are many) to fill in those missing dates. We grab all of the days in the desired range by adding days to the start date from our Numbers table, and then perform an outer join against the sales table. Before we get to the actual results (grouped by date), let's take an initial look at how just the join looks:
SELECT n.Number, LeftRange = DATEADD(DAY, n.Number, @StartDate), RightRange = DATEADD(DAY, n.Number + 1, @StartDate), s.OrderDate, s.Total FROM dbo.Numbers AS n LEFT OUTER JOIN dbo.SampleSales AS s ON s.OrderDate >= DATEADD(DAY, n.Number, @StartDate) AND s.OrderDate < DATEADD(DAY, n.Number + 1, @StartDate) WHERE n.Number <= DATEDIFF(DAY, @StartDate, @EndDate) ORDER BY n.Number;
The output shows the value from the Numbers table, the beginning of the date range we need to join on, the next day (which serves as the end of the range we need to join on), and then the order date and total from the sales table that match the join conditions. As you can see, we get 2 rows for each of the days with sales, and a single row for the date that is missing from the sales table (and in that row, everything coming from the sales table is NULL):
Number LeftRange RightRange OrderDate Total ------ ---------- ---------- ----------------------- ----- 0 2016-02-01 2016-02-02 2016-02-01 03:45:00.000 40.00 0 2016-02-01 2016-02-02 2016-02-01 13:44:00.000 10.00 1 2016-02-02 2016-02-03 NULL NULL 2 2016-02-03 2016-02-04 2016-02-03 04:52:00.000 32.00 2 2016-02-03 2016-02-04 2016-02-03 06:55:00.000 80.00
So to get the report we really want, we simply need to adjust this query by removing some of the output columns, then grouping by the beginning of each date range and summing the sales (using COALESCE in order to show 0 instead of NULL):
SELECT OrderDate = DATEADD(DAY, n.Number, @StartDate), TotalSales = COALESCE(SUM(s.Total), 0) FROM dbo.Numbers AS n LEFT OUTER JOIN dbo.SampleSales AS s ON s.OrderDate >= DATEADD(DAY, n.Number, @StartDate) AND s.OrderDate < DATEADD(DAY, n.Number + 1, @StartDate) WHERE n.Number <= DATEDIFF(DAY, @StartDate, @EndDate) GROUP BY n.Number ORDER BY n.Number;
Now the results are the ones we wanted, with a row for February 2nd:
OrderDate TotalSales ---------- ---------- 2016-02-01 50.00 2016-02-02 0.00 2016-02-03 112.00
This does not yield a better-looking plan in this contrived case, but I think it will beat many of the alternatives you might use to fill in that missing row (for example, a loop that steps through all of the dates in the sales table, stores the ones it doesn't find in a temporary table, and then unions that with the grouping query above).
As with other problems, for this task your Numbers table would need to be large enough to cover every distinct row possible in the output (including the missing rows).
By far the most common use I've had for a Numbers table is to split strings. While an atrocity like a comma-separated string flies in the face of relational databases, and many would argue it has no place inside of SQL Server, the reality is that this is something we're just going to have to get used to dealing with. So, I'm going to construct a very simple example; we have the following string:
DECLARE @List VARCHAR(4000) = 'Mark,Sindy,Rob,Casey';
Regardless of where this list came from or what the output will be used for, assume that we have been charged with producing a set from this using T-SQL. The output should look like this:
Name ----- Mark Sindy Rob Casey
Over the years, I have found that most people approach this problem with a very programmatic and iterative mindset. They proceed with a loop that goes through the entire string, one character at a time, and builds the next value until it comes across the next delimiter. I have seen this solution implemented time and time again, and is almost always a source of performance issues:
DECLARE @List VARCHAR(4000) = 'Mark,Sindy,Rob,Casey', @Delimiter CHAR(1) = ','; DECLARE @chunks TABLE(chunk VARCHAR(4000)); DECLARE @idx SMALLINT = 1, @char CHAR(1), @chunk VARCHAR(4000) = ''; WHILE @idx <= LEN(@List) BEGIN SET @char = SUBSTRING(@List, @idx, 1); IF @char = @Delimiter BEGIN IF @chunk > '' INSERT @chunks(chunk) SELECT @chunk; SET @chunk = ''; END ELSE BEGIN SET @chunk += @char; END SET @idx += 1; END IF @chunk > '' INSERT @chunks(chunk) SELECT @chunk; SELECT Name = chunk FROM @chunks;
Another approach I have seen is a little better, by processing whole words at a time instead of building the individual elements character by character, but still sub-optimal:
DECLARE @List VARCHAR(4000) = 'Mark,Sindy,Rob,Casey', @Delimiter CHAR(1) = ','; DECLARE @chunks TABLE(chunk VARCHAR(4000)); DECLARE @idx SMALLINT, @pos SMALLINT; WHILE LEN(@List) > 0 BEGIN SET @idx = CHARINDEX(@Delimiter, @List); IF @idx = 0 BEGIN INSERT @chunks(chunk) SELECT @List; BREAK; END ELSE BEGIN INSERT @chunks(chunk) SELECT SUBSTRING(@List, 1,@idx - 1); SET @pos = @idx + 1; SET @List = SUBSTRING(@List, @pos , LEN(@List) - @pos + 1); END END SELECT Name = chunk FROM @chunks;
I'm not going to explain these routines in any depth because they are fairly straightforward to follow and, performance aside, they get the desired results. I'm sure there are dozens of other approaches that follow the same type of loop-based thinking, with slight variations in implementation; also, beware of any clever-looking XML-based splitting mechanisms, as they tend to carry a lot of additional overhead, even though there might not be any loop-based logic visible (evidence here).
The approach I like that uses a Numbers table is rather concise, but I am going to expand it here so that I can explain how it works. Starting with the same list of strings, we're going to join the list of strings to the Number table, but only at each location where a comma is located. This means that the join will produce one row for each comma. Then all we have to do is:
DECLARE @List VARCHAR(4000) = 'MarkBob', @Delimiter CHAR(1) = ','; SELECT List = @List, BeginningOfElement = Number, NextDelimiter = CHARINDEX(@Delimiter, @List + @Delimiter, Number) FROM dbo.Numbers WHERE Number <= LEN(@List) AND CHARINDEX(@Delimiter, @Delimiter + @List, Number) = Number;
Inspecting the output, we can see that the join allows us to identify the location after each comma (and we prepend one using @Delimiter + List), and the location of each subsequent comma (and we append one using @List + @Delimiter):
List BeginningOfElement NextDelimiter -------------------- ------------------ ------------- Mark,Sindy,Rob,Casey 1 5 Mark,Sindy,Rob,Casey 6 11 Mark,Sindy,Rob,Casey 12 15 Mark,Sindy,Rob,Casey 16 21
I exposed List in the output as well, so that it would be visually clear that, on each row, you could take the substring of List, from BeginningOfElement to NextDelimiter - 1, and this would produce the string in that location. To visualize further:
List BeginningOfElement NextDelimiter -------------------- ------------------ ------------- Mark,Sindy,Rob,Casey 1 5 -- char 1 ---^ ^--- char 5 Mark,Sindy,Rob,Casey 6 11 -- char 6 ---^ ^--- char 11 Mark,Sindy,Rob,Casey 12 15 -- char 12 ---^ ^--- char 15 Mark,Sindy,Rob,Casey 16 21 -- char 16 ---^ ^--- char 21
So, we can take the query above, stuff it into a subquery or CTE, and then extract the relevant parts of each string.
;WITH cteW AS ( SELECT List = @List, BeginningOfElement = Number, NextDelimiter = CHARINDEX(@Delimiter, @List + @Delimiter, Number) FROM dbo.Numbers WHERE Number <= LEN(@List) AND CHARINDEX(@Delimiter, @Delimiter + @List, Number) = Number ) SELECT Name = SUBSTRING(@List, BeginningOfElement, NextDelimiter - BeginningOfElement) FROM cteW;
This could be made even more concise, removing the CTE and friendly output column names, at a measured cost of readability:
SELECT Name = SUBSTRING(@List, Number, CHARINDEX(@Delimiter, @List + @Delimiter, Number) - Number) FROM dbo.Numbers WHERE Number <= LEN(@List) AND CHARINDEX(@Delimiter, @Delimiter + @List, Number) = Number;
For splitting strings, your numbers table would have to have enough numbers to cover the length of the longest supported string. This will commonly be limited to 4,000 (for Unicode) or 8,000 (for ASCII), but if you need to support MAX types, you'll need to decide if you really want to allow strings as large as 1 or 2 billion characters - there is probably a practical limit somewhere that you can arrive at.
This is the foundation for a Numbers table-based string splitting mechanism, but ignores a lot of items for simplicity - for example, dealing with delimiters that have more than one character, using Unicode strings or MAX types, returning the results in a specific order, eliminating duplicates, and handling edge cases like empty elements, single-element variables, leading and trailing spaces, and so on. I may deal with these issues in a future tip, but to get more details on some of them in the meantime, see Removing Duplicates from Strings in SQL Server. (Though honestly, if you are splitting strings often, there are even better alternatives, such as much more powerful string handling in CLR, or using table-valued parameters to avoid splitting in the first place.)
In this pair of tips, I set out to explain how a Numbers table works, and to give a bit more detail about a few of the more common problems it can solve. A lot of people object to the storage and memory requirements of another table, but I hope I have shown that it can be a useful and worthwhile weapon in your arsenal.
- See these related tips and other resources:
- The SQL Server Numbers Table, Explained - Part 1
- Removing Duplicates from Strings in SQL Server
- Validate the contents of large dynamic SQL strings in SQL Server
- Comparing string splitting / concatenation methods
- Generate a set or sequence without loops - part 1 - part 2 - part 3
- Why are numbers tables "invaluable"?
- You REQUIRE a Numbers table!
- All SQL Server Data Type Tips
Last Updated: 2016-02-26
About the author
View all my tips