Using Window Functions to Create Smart, Quick Queries


By:   |   Updated: 2021-02-16   |   Comments   |   Related: More > Functions System


Problem

I've heard that SQL Server window functions can help improve certain types of queries. Can you show me some examples of this and how to make queries run faster?

Solution

Window functions have several use cases when writing TSQL statements. They can help rank the rows of a dataset or they can be used to create aggregations of a column of data across a limited number of rows, such as a sum of the last 30 rows instead of a sum of all the rows.

This tip will show 2 different ways that window functions can help a query run faster when compared to an older methodology that was popular before SQL Server added these functions. Each example we will show the older way to implement the query along with the window function method. It will compare the performance of each to quantify the improvement

This tip assumes a basic knowledge of what STATISTICS IO and STATISTICS TIME are and what query plans are. If unfamiliar with these features, this tip will offer a primer on STATISTICS TIME and IO and this tip will cover query plans. Both will prepare the reader for the information in this tip.

All of the demos in this tip will use the WideWorldImporters sample database which can be downloaded from here and will be run against SQL Server 2019. The images may be different, but the methodology should still work on older versions of SQL Server.

Overview of SQL Window Functions for Ranking Rows

One of the main uses of window ranking functions such as ROW_NUMBER, RANK, and DENSE_RANK is to rank a set of rows based on sorting criteria.

These functions can either rank an entire dataset or it can rank separate partitions. Think of a partition as a GROUP BY for the window function. Where a GROUP BY allows for separate aggregates to be performed on each unique column value set, a PARTITION BY in a window function starts the ranking process with the OVER clause at each change PARTITION BY column change. This will make more sense in the examples below.

Consider this query using window function ROW_NUMBER().

SELECT C.CustomerName, i.InvoiceDate, i.CustomerPurchaseOrderNumber, il.ExtendedPrice, 
   ROW_NUMBER() OVER (ORDER BY il.ExtendedPrice DESC) RowRank
FROM Sales.Invoices i
  INNER JOIN Sales.InvoiceLines il ON i.InvoiceID = il.InvoiceID
  INNER JOIN Sales.Customers c ON i.BillToCustomerID = c.CustomerID
WHERE
  InvoiceDate BETWEEN '1/1/2014' AND '3/31/2014';

This query window function starts at 1 and counts up as the ExtendedPrice goes down, as shown below in the result set.

The RowRank column orders the output based on the Extended price of each item.

Next, a PARTITION BY will be added to the window function on the BillToCustomerID column.

SELECT C.CustomerName, i.InvoiceDate, i.CustomerPurchaseOrderNumber, il.ExtendedPrice, 
   ROW_NUMBER() OVER (PARTITION BY i.BillToCustomerID ORDER BY il.ExtendedPrice DESC) RowRank
FROM Sales.Invoices i
  INNER JOIN Sales.InvoiceLines il ON i.InvoiceID = il.InvoiceID
  INNER JOIN Sales.Customers c ON i.BillToCustomerID = c.CustomerID
WHERE
  InvoiceDate BETWEEN '1/1/2014' AND '3/31/2014';

Notice how the RowRank column starts at 1 and counts up like before, but now at every change in customer based on BillToCustomerID it resets to 1 and starts counting up again. This feature is the part that often makes this function so powerful.

This screenshot shows how every time the customer changes, the RowRank column starts back over at one and counts up again.

Ranking Rows Tutorial in SQL Server

Ranking rows using a window function and a PARTITION BY is a great way reduce the size and complexity of some queries that require subqueries.

Query Syntax without a Window Function

Consider the query below from this previous tip. A TSQL developer has been tasked with finding the invoice date and purchase order number for each customer's largest individual purchase during the first quarter of 2014. The solution from that tip did not include a window function – but maybe it should have!

The former solution has 2 parts.

  • In the first part, the TopPurchase CTE, calculates the amount of the highest purchase for each customer.
  • In the second part, a similar query is run to include the details of that query.
-- first query without window function
WITH TopPurchase AS(  SELECT BillToCustomerID, MAX(ExtendedPrice) Amt
  FROM Sales.Invoices i
    INNER JOIN Sales.InvoiceLines il ON i.InvoiceID = il.InvoiceID
  WHERE
    InvoiceDate BETWEEN '1/1/2014' AND '3/31/2014'
  GROUP BY
    BillToCustomerID
  )
SELECT C.CustomerName, i.InvoiceDate, i.CustomerPurchaseOrderNumber 
FROM Sales.Invoices i
  INNER JOIN Sales.InvoiceLines il ON i.InvoiceID = il.InvoiceID
  INNER JOIN Sales.Customers c ON i.CustomerID = c.CustomerID
  INNER JOIN TopPurchase t ON t.BillToCustomerID = i.BillToCustomerID AND t.Amt = il.ExtendedPrice
WHERE
     InvoiceDate BETWEEN '1/1/2014' AND '3/31/2014';

Query Syntax with a Window Function

How can this be accomplished with a window function? Start by implementing the window function that partitions by BillToCustomerID and sorts the amounts descending order.

SELECT C.CustomerName, i.InvoiceDate, i.CustomerPurchaseOrderNumber, il.ExtendedPrice,       RANK() OVER (PARTITION BY i.BillToCustomerID ORDER BY il.ExtendedPrice DESC) RowRank 
FROM Sales.Invoices i
  INNER JOIN Sales.InvoiceLines il ON i.InvoiceID = il.InvoiceID
  INNER JOIN Sales.Customers c ON i.BillToCustomerID = c.CustomerID
WHERE
  InvoiceDate BETWEEN '1/1/2014' AND '3/31/2014';

This returns the RowRank for all of the customer records, so to get the largest transaction for each customer we just need to get the records where RowRank has a value of 1.

query results

We could try the following query to just get RowRank = 1 as shown in the last line of the query in the below image. Sadly, it's not quite that simple. The window function can't be used in a WHERE clause as shown in the error message.

error message from window function

The query below is the final version. The first query is placed into a CTE so that the RowRank column can be placed in a WHERE clause in the main query and this will return the results we want.

-- second query with window function
WITH TopPurchase AS (
  SELECT C.CustomerName, i.InvoiceDate, i.CustomerPurchaseOrderNumber, 
     RANK() OVER (PARTITION BY i.BillToCustomerID ORDER BY il.ExtendedPrice DESC) RowRank
  FROM Sales.Invoices i
    INNER JOIN Sales.InvoiceLines il ON i.InvoiceID = il.InvoiceID
    INNER JOIN Sales.Customers c ON i.BillToCustomerID = c.CustomerID
  WHERE
    InvoiceDate BETWEEN '1/1/2014' AND '3/31/2014'
  )
SELECT CustomerName, InvoiceDate, CustomerPurchaseOrderNumber
FROM TopPurchase
WHERE RowRank = 1;

Performance Comparison Window Function versus No Window Function

The query with the window function is shorter by number of lines and performs better than the query without the window function that uses a larger subquery. If we look at the query without a window function and the query with a window function and include STATISTICS IO, TIME and the query execution plan this makes that very clear.

STATISTICS IO shows the same number of rows returned by each query and reduced reads and scans for both Invoices and InvoiceLines.

This screenshot of the STATISTICS IO output shows 5 scans and 12,397 for the first version of the query and 2 scans and 9,409 for the new version based on the window function.

The output of STATISTICS TIME shows a similar story with the first query running over 4x as long as the second, that's significant optimization.

This screenshot of the output of STATISTICS TIME shows the first query taking 467ms and the window function version taking 103ms.

Finally, the query execution plan estimates that the first query costs slightly more than the second.

This screenshot of the query execution plans shows that the first query costs 58% of the batch while the window function version costs only 42% of that same batch.

Consider using window functions in situations like this when looking for the biggest, earliest, or first row per customer, per day, etc. It tends to be much faster than alternative methods.

Rolling Aggregates in SQL Server

A rolling aggregate is a common KPI. These are commonly seen as "To-Date" or "Moving Average" values on reports where the current row value is the average of this row and the prior 30 or the sum of this row and all prior rows. Common examples would be a 30-day or 200-day moving average or Month-To-Date columns.

Query without a Window Function

Before the invention of windowed aggregates these were often accomplished via inline subqueries. The below query is ugly, but it accomplishes the task of calculating a 30 day moving average and a running total for year-to-date sales for 2013.

SELECT i.InvoiceDate, SUM(il.ExtendedPrice) TotalDollars
, (SELECT SUM(il2.ExtendedPrice)/ COUNT(DISTINCT i2.InvoiceDate) 
   FROM Sales.Invoices i2 INNER JOIN Sales.InvoiceLines il2 ON i2.InvoiceID = il2.InvoiceID 
   WHERE i2.InvoiceDate BETWEEN DATEADD(dd, -30, i.InvoiceDate) AND i.InvoiceDate) [30BusinessDayMovingAverage]
, (SELECT SUM(il3.ExtendedPrice) 
   FROM Sales.Invoices i3 INNER JOIN Sales.InvoiceLines il3 ON i3.InvoiceID = il3.InvoiceID 
   WHERE i3.InvoiceDate BETWEEN '1/1/2013' AND i.InvoiceDate) [2013YearToDate]
FROM Sales.Invoices i
  INNER JOIN Sales.InvoiceLines il ON i.InvoiceID = il.InvoiceID
WHERE i.InvoiceDate BETWEEN '1/1/2013' AND '12/31/2013'
GROUP BY i.InvoiceDate
ORDER BY i.InvoiceDate;

Query with a Window Function

Accomplishing these same calculations with a windowed aggregate function is much easier.

These queries will start with the CTE below that creates one output row per business day with the revenue for that day.

WITH DollarsPerDay AS (
SELECT i.InvoiceDate, SUM(ExtendedPrice) TotalDollars
FROM Sales.Invoices i
  INNER JOIN Sales.InvoiceLines il ON i.InvoiceID = il.InvoiceID
GROUP BY i.InvoiceDate)

We can combine the main query that reads the CTE and use 2 windowed aggregate functions to calculate the moving average and year-to-date amounts. 

  • The first averages the TotalDollars column for this row and the 29 preceding following rows when sorted by InvoiceDate to create a 30-day moving average. It is not an average of the entire data set – just those 30 rows per row!
  • The second value creates a sum of this row and all previous rows. Only the last row in the data set will perform the aggregate function on every row in the data set.
WITH DollarsPerDay AS (
SELECT i.InvoiceDate, SUM(ExtendedPrice) TotalDollars
FROM Sales.Invoices i
  INNER JOIN Sales.InvoiceLines il ON i.InvoiceID = il.InvoiceID
GROUP BY i.InvoiceDate)
SELECT InvoiceDate
  , TotalDollars
  , AVG(TotalDollars) OVER (ORDER BY InvoiceDate ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) [30BusinessDayMovingAverage]
  , SUM(TotalDollars) OVER (ORDER BY InvoiceDate ROWS UNBOUNDED PRECEDING) [2013YearToDate]
FROM DollarsPerDay
WHERE InvoiceDate BETWEEN '1/1/2013' AND '12/31/2013'
ORORDER BY InvoiceDate;

Performance Comparison Window Function versus No Window Function

Running both versions in immediate succession reveals some stark differences in performance. First, STATISTICS TIME shows an improvement of over 99% with the window function version.

This screenshot of STATISTICS IO output shows the subquery version taking over 36 seconds while the windowed aggregate took just 149ms to complete.

The output from STATISTICS IO agrees showing another 99%+ improvement using the window function version.

This screenshot of STATISTICS IO output for the 2 queries shows over 1.7 million reads on the first query and just 2,851 on the second version using windowed aggregates.

Finally, the query plan estimate agrees that the difference in methods is incredibly large and the window function version is much more efficient.

This screenshot of the query plan output in SSMS shows the first query using 100% of the effort in the batch and the windowed aggregate version using 0%.

This author has seen enough of the slow non-window function type of query in plenty of TSQL code that it is probably worth reviewing any reports to see if they use the subquery-type of logic. This is especially true if the reports were written before the windowed aggregate code was available. The programmer that can take that 36 second report down to sub-second will surely look like a hero to the end users!

Next Steps





get scripts

next tip button



About the author
MSSQLTips author Eric Blinn Eric Blinn is the Sr. Data Architect for Squire Patton Boggs. He is also a SQL author and PASS Local Group leader.

View all my tips


Article Last Updated: 2021-02-16

Comments For This Article





download














get free sql tips
agree to terms