SQL Server Between Dates Issue



By:


In this video tutorial we will look at why using BETWEEN for date range queries is a bad idea and how this can lead to inconsistences with your queries.

Referenced Links in Video

Recommended Reading

T-SQL Script

The below script will allow you to run the tests discussed in the video.  You can also download the script.

USE tempdb;
SET NOCOUNT ON;
GO

-- create a simple orders table
CREATE TABLE dbo.Orders
(
  OrderID   int IDENTITY(1,1) PRIMARY KEY,
  OrderDate datetime2(7)
);
GO

-- generate all the days in February at midnight
WITH x AS (SELECT n = 0 UNION ALL SELECT n + 1 FROM x WHERE n < 28)
SELECT DATEADD(DAY, n, '20200201') FROM x;

-- insert an order for each day in February
WITH x AS (SELECT n = 0 UNION ALL SELECT n + 1 FROM x WHERE n < 28)
INSERT dbo.Orders(OrderDate) 
SELECT DATEADD(DAY, n, '20200201') FROM x;

-- validate there are 29 rows, all in February:
SELECT DATENAME(MONTH, OrderDate), YEAR(OrderDate), COUNT(*)
FROM dbo.Orders
GROUP BY DATENAME(MONTH, OrderDate), YEAR(OrderDate); 

-- then add a few edge cases:
INSERT dbo.Orders(OrderDate)
VALUES
  ('20200229 23:59:31'),
  ('20200229 23:59:59.9985000'),
  ('20200301');

-- check again, 31 orders in February, one in March:
SELECT DATENAME(MONTH, OrderDate), YEAR(OrderDate), COUNT(*)
FROM dbo.Orders
GROUP BY DATENAME(MONTH, OrderDate), YEAR(OrderDate); 
GO

-- now use various tricks to find the "end" of the month

-- EOMONTH
DECLARE @m datetime = '20200201'

SELECT 'EOMONTH', OrderCount = COUNT(*) 
FROM dbo.Orders
WHERE OrderDate BETWEEN @m AND EOMONTH(@m);
GO

-- subtract a minute with smalldatetime
DECLARE @s smalldatetime = '20200201';
DECLARE @e smalldatetime = DATEADD(MINUTE, -1, DATEADD(MONTH, 1, @s));

SELECT '1m smalldatetime', OrderCount = COUNT(*)
FROM dbo.Orders
WHERE OrderDate BETWEEN @s AND @e;
GO

-- common 3 millisecond trick
DECLARE @s datetime = '20200201';
DECLARE @e datetime = DATEADD(MILLISECOND, -3, DATEADD(MONTH, 1, @s));

SELECT '3ms datetime', OrderCount = COUNT(*) 
FROM dbo.Orders
WHERE OrderDate BETWEEN @s AND @e;
GO

-- common 3 millisecond trick with smalldatetime
DECLARE @s smalldatetime = '20200201';
DECLARE @e smalldatetime = DATEADD(MILLISECOND, -3, DATEADD(MONTH,1,@s));

SELECT '3ms smalldatetime', OrderCount = COUNT(*)
FROM dbo.Orders
WHERE OrderDate BETWEEN @s AND @e;
GO

-- common 3 millisecond trick with datetime2
DECLARE @s datetime2(7) = '20200201';
DECLARE @e datetime2(7) = DATEADD(MILLISECOND, -3, DATEADD(MONTH,1,@s));

SELECT '3ms datetime2', OrderCount = COUNT(*)
FROM dbo.Orders
WHERE OrderDate BETWEEN @s AND @e;
GO

-- Always use a half-open interval.
-- Easier to find beginning of next period than end of current period:
DECLARE @start date = '20200201';

SELECT 'open interval', OrderCount = COUNT(*)
  FROM dbo.Orders
 WHERE OrderDate >= @start
   AND OrderDate <  DATEADD(MONTH, 1, @start);
GO

-- clean up:
DROP TABLE dbo.Orders;
GO

Video Transcript

Hi, I'm Aaron Bertrand with a series of short videos for MSSQLTips.com. Today I wanted to talk about the Between operator, which is something that people use for date range queries, and why that's usually a bad idea.

Even in English between is ambiguous. Usually you would assume, someone says, we're open between nine and five that it includes 9:00 and includes 5:00. It doesn't start at 9:01 and end at 4:59, the between is inclusive. However, we have other cases like soccer, where when you kick the ball, you kick the ball between the posts. You don't hit the posts, you don't want to hit either of the posts, you want to get it between the posts, which means don't include the beginning and the end of the range, if you will.

In T-SQL in SQL Server, between is a closed interval, which means it includes both ends of the range. This is great for whole numbers, so something where both goal posts are meant to be included. So if we have a police officer that is capturing speeds of people that are driving down the road, you can say where speed between 70 and 79, those are measured in whole numbers, whole miles per hour, or whole kilometers per hour, so there's no chance of when you run the next query between 80 and 89 that you're going to miss someone going 79.6. So whole numbers, awesome, between works just fine.

But then if fractions need to be involved, it's not so great. So you have a stock price which can have nine decimal places or 12 decimal places, or some ridiculous, you know, granularity in very small amounts. So you can say where stock price between 60 and 70, but then if your next query between 70 and 80, isn't between 70.00000000, however many decimal places you need, and then finish with a one, you're going to have this issue where you're either going to include both ends, values of 70 in both queries, or you're going to have, you're going to have to write between 60 and 69.999999999. And nobody wants to write that kind of stuff, right? So for date ranges, closed interval is typically bad and the reason is that calculating the end of a range is hard.

The reason it's a problem is for example, when is the end of February? When is the exact last moment in February? It depends. February's a little extra complicated because of leap years but for any month, what is the last moment in any month that could be recorded in a table? It depends on the data type that's being used now. It depends on the data type that might be used in the future, if you ever do any conversions. And it depends on the parameters and how your data types are converted before they're compared to the data that's in the table.

So if you think about February as a timeline, we've got a report that we want to run and we want to include all the data in February, the beginning is always easy, right? It's very easy to calculate the beginning of the month, you want to say, when you say between February 1st and something, that always includes February 1st at midnight. So that's not a problem. The problem is when we get to the end of the month. So the last day in February is usually the 28th, this year it happens to be the 29th. So let's assume this year. But when is the last moment of February 29th? That depends on the data type.

If I take the date data type and I include all of this data from February 1st up until the 29th, and I convert some time on the 29th to date, it's going to truncate, it's going to round down to February 29th at midnight. This means my report is going to be missing an entire day's worth of data. If I use smalldatetime, same thing happens with a minute. It's going to round down, it's going to truncate to, well, it depends on what I do, if I subtract a minute from midnight on March 1st, then I'm going to miss a minute of data. If I don't subtract a minute, if I do one of these subtract three milliseconds tricks, it's going to round up and I'm actually going to include data from March 1st. I don't want that either.

If I use datetime, this is the three millisecond trick that a lot of people use, subtract three milliseconds from the beginning of the next month. That gives you the end of this month, and it's not the end of this month because if you're using datetime2, you can actually get all the way up to 59.999999, and so if you use any of the other data types and it's going to have to round to a value that's supported by those data types, which means you could potentially lose data.

So let's take a look at a demo. I'm in tempdb, I'm going to create a very simple orders table. It just has an orderID column and an order date. And then for each day in February, let me do an insert an order, at midnight. So this is just a recursive CTE that grabs 29 rows from February and inserts them into the table. And if we just want to validate that and group by that, and there we have February 2020. We have 29 rows. But that's not enough to have those days at midnight to demonstrate the problem.

So I'm also going to insert a couple of edge cases. One is 23:59:31, on the last day. One is 1.5 milliseconds before midnight, and then one is just fully rounded March 1st. So let's insert those rows in here. And then we can validate again that the orders that we see in that table are the ones we expect. We should see 31 orders in February, 29 for all of those days at midnight, and then the two edge cases, and then one order in March, that happened on March 1st at midnight. And there you go, February 2020, 31 rows, March 2020, one row. That works fine when I'm using the datename and I'm getting a count, but what if I'm using a range and what if I have other criteria in my where clause that doesn't allow me to just simply group by some output that I get from a function?

So if we use function EOMONTH, that was introduced in SQL Server 2012, EOMONTH sounds like it should give you the end of the month, but it doesn't give you the end of the month, it gives you the beginning of the last day of the month, but it's not the end of the month in any of the other data types that we have to deal with. So when I do this and I say give me all the rows, between February 1st and the End of the month for February, and what happens is I get 29. So it's missing those two edge case rows that happened after midnight. And if I had 600,000 other orders during the day on February 29th it would miss those too.

The three millisecond trick, where we subtract three milliseconds from midnight of March 1st, similar thing happens here. So we get 30 rows because three milliseconds before midnight is 23:59:59.997. There are 30 orders to come back because I had that one order that was at 59.9985. So it missed the order, now while you might say that is statistically not relevant, it is absolutely inaccurate. Why would you open yourself up to the risk of missing an order, maybe missing a very important order that someone squeezed in right under the wire before midnight?

If we do the same thing with smalldatetime, what happens here is that actually smalldatetime rounds up so we subtract three milliseconds from midnight on March the 1st, smalldatetime, can't contain a millisecond granularity, so what happens is it just rounds up to the next minute. If I subtracted 31 seconds from midnight, it would round down to 23:59. I would still lose data, but it this case it actually rounds up. So even though I subtracted time away from March 1st, the range I'm comparing it to is March 1st, I actually get 32 orders here because it included that row from March 1st.

If I do that same three millisecond trick with datetime2, this is actually no different from the trick with datetime, the difference really is if you have a store procedure that has a datetime parameter, and you make a sweeping change in your environment, and say I'm going to convert all of the datetime data types to datetime2, you're going to catch the datetime parameter in the store procedure you're suddenly going to change that to datetime2, and now you're going to be passing a more accurate values into the stored procedure, but then that change is not going to catch this expression which subtracts three milliseconds from that accurate datetime2 that you passed in and so you're still going to end up with .997 and your reports are still going to be inaccurate.

So those types of data type changes that can happen but are still going to cause the problem because your other expressions still treat that as a datetime, right? So we still only have 30 orders and it missed that one at .9985. So most times we want a semi-open interval. And what that means is that we include the beginning of the range, but we don't include the end of the range. It's always easier to find the beginning of the next period than the end of the current period.

It would be really nice if Between had extensions. I've seen discussions about this online and I have a link int the last slide that points to one of these. If the SQL syntax allowed Between to say between beginning of February inclusive and the beginning of March exclusive then you could easily write a query that said I want all of February, so include February 1st at midnight in the range, don't include March 1st at midnight in the range. That's what the inclusive and exclusive keywords would indicate, but that's not supported today. The way we do that is a semi-open interval where orderdate is greater than or equal to February 1st and orderdate is less than March 1st.

We go back to the demo. We can use any data type here at all. So if we assign this to a date or a datetime2 or a smalldatetime, what have you? It doesn't matter what it is, it's always going to resolve to February 1st at midnight, regardless of the data type, so we can use that in the beginning of our range, which instead of using Between, I'm saying where orderdate is greater than or equal to the start of my range, and then where orderdate is less than the beginning of the next range. This is the only query in this entire sample that gives you the correct answer of 31 rows.

So I hope that was a good explanation for you about why you shouldn't use Between for date range queries. I've got resources here that you can look at. I've got all the date and time related tips on MSSQLTips.com. I've got an index of my bad habits and best practices blog posts, a lot of those are date and time related. And then there's also a link here to a discussion about how the Between design got implemented in the SQL standard. There are some other resources below. Thank you very much for watching and I'll see you in the next video.


Last Update: 4/21/2020






Comments For This Article





download


get free sql tips
agree to terms


Learn more about SQL Server tools