Return TOP (N) Rows using APPLY or ROW_NUMBER() in SQL Server
Have you ever heard the old saying that there's more than one way to peel a banana? There's also a variation about cats, but I love cats. The same thing applies to SQL. Most times, there are multiple ways to do the same thing. For instance, if you need to return a specific number of rows from a table-valued function or expression. A typical method developers use for accomplishing this task is the APPLY operator. However, is using APPLY the best method for getting the results? What about using the ROW_NUMBER() window function to return the results? How do you know which one to choose? Also, what makes one better than the other?
In this article, we'll compare two methods for achieving the same results. First, I'll explore the APPLY operator in detail, including when you typically see it used. Did you know it comes in two variants? Then we'll compare APPLY to ROW_NUMBER() for returning the TOP (n) rows from a table expression. I'll cover the criteria used for measuring performance. Which one do you think will win in a head-to-head comparison? By the end of this article, you'll know which one to choose for your next SQL project.
Microsoft introduced the APPLY operator in SQL 2005. In an article, Arshad Ali describes APPLY as a join clause: "it allows joining between two table expressions, i.e., joining a left/outer table expression with a right/inner table expression." Since both tables can technically be an expression, for the rest of the article and to reduce confusion, I will refer to the first/left table as a table and the second as a table expression.
Another handy use of APPLY is with the OPENJSON function. You can transform JSON data into a relational format. Andrew Villazon created an extensive blog article covering the topic.
The APPLY operator comes in two variants. The first is the CROSS APPLY, which should not be confused with a join that produces a Cartesian product. The second is called the OUTER APPLY.
It is helpful to think of a CROSS APPLY as an INNER JOIN—it returns only the rows from the first table that exist in the second table expression. You'll sometimes refer to this as the filtering or limiting type since you filter rows from the first table based on what's returned in the second.
SELECT ft.ColumnName, st.Amount FROM dbo.FirstTable ft CROSS APPLY ( SELECT st.Amount FROM dbo.SecondTable st WHERE st.FirstTableId = ft.Id ) st;
On the other hand, OUTER APPLY shares similarities with an OUTER JOIN. It returns all the rows from the first table and matches from the second. You hear this one referred to as NULL producing. If a row doesn't exist in the table expression, it's represented with a NULL.
SELECT ft.ColumnName, st.Amount FROM dbo.FirstTable ft OUTER APPLY ( SELECT st.Amount FROM dbo.SecondTable st WHERE st.FirstTableId = ft.Id ) st;
Return TOP(n) Rows
A typical request you see APPLY used for is returning the TOP (n) rows from the second result set. Now that could be either CROSS or OUTER. It depends on your needs. The syntax for accomplishing this might look like the code below.
SELECT ft.ColumnName, st.Amount FROM dbo.FirstTable ft CROSS APPLY ( SELECT TOP(2) st.Amount FROM dbo.SecondTable st WHERE st.FirstTableId = ft.Id ORDER BY st.Amount DESC ) st;
The ORDER BY clause dictates in what order the rows are ranked. In the example above, if you wanted to include the two highest, you would use the keyword DESC/DESCENDING.
The second method to return the TOP(n) rows is with ROW_NUMBER(). If you've read any of my other articles on window functions, you know I love it. The syntax below is an example of how this would work.
;WITH cte_HighestSales AS ( SELECT ROW_NUMBER() OVER (PARTITION BY FirstTableId ORDER BY Amount DESC) AS RowNumber, Amount, FirstTableId FROM dbo.SecondTable ) SELECT ft.ColumnName, st.Amount FROM dbo.FirstTable ft INNER JOIN cte_HighestSales st ON st.FirstTableId = ft.Id AND st.RowNumber < 3;
Which one should you use in practice? To answer that question, we'll need to build a dataset and compare the performance of the two methods.
Build a Test Dataset
The code below creates two tables. The first table is smaller and acts more as a dimension. The second one has the most rows.
USE [master]; GO IF DATABASEPROPERTYEX( 'PizzaTracker', 'Version' ) IS NOT NULL BEGIN ALTER DATABASE PizzaTracker SET SINGLE_USER WITH ROLLBACK IMMEDIATE; DROP DATABASE PizzaTracker; END; GO CREATE DATABASE PizzaTracker; GO ALTER DATABASE PizzaTracker SET RECOVERY SIMPLE; GO USE PizzaTracker; GO CREATE TABLE dbo.MutantFighters ( Id INT PRIMARY KEY, [Name] VARCHAR(50) NOT NULL, Species VARCHAR(50) NOT NULL ); INSERT INTO dbo.MutantFighters ( Id, Name, Species ) VALUES (1, 'Leonardo', 'Turtle'), (2, 'Michelangelo', 'Turtle'), (3, 'Donatello', 'Turtle'), (4, 'Raphael', 'Turtle'), (5, 'Splinter', 'Rat'), (6, 'Bebop', 'Warthog'), (7, 'Rocksteady', 'Rhino'); CREATE TABLE dbo.PizzaLog ( Id INT IDENTITY(1, 1) NOT NULL, MutantId INT NOT NULL, Slices INT NOT NULL, DateEaten DATE NOT NULL ); -- Generate dates for February. DECLARE @StartDate DATE = '2023-02-01'; DECLARE @EndDate DATE = '2023-02-28'; WITH Dates AS ( SELECT @StartDate AS Date UNION ALL SELECT DATEADD( DAY, 1, Date ) AS Date FROM Dates WHERE Date < @EndDate ) INSERT INTO dbo.PizzaLog ( MutantId, Slices, DateEaten ) SELECT mf.Id, CASE WHEN mf.Id = 2 THEN ABS(CHECKSUM(NEWID()) % 15) + 3 -- Mikey eats the most pizza ELSE ABS(CHECKSUM(NEWID()) % 10) + 1 END AS Slices, d.Date FROM dbo.MutantFighters mf CROSS JOIN Dates d WHERE mf.id <> 5; GO
In the query below, I want to retrieve the three days each character consumed the most pizza slices. I'll start with a CROSS APPLY and an INNER JOIN on the ROW_NUMBER(). I will also turn on my actual execution plan and STATISTICS IO. We'll compare the number of logical reads and the plan cost. The plan cost isn't an exact science, but it gives a decent approximation.
SET STATISTICS IO ON; SELECT mf.[Name], pl.DateEaten, pl.Slices FROM dbo.MutantFighters mf CROSS APPLY ( SELECT TOP (3) DateEaten, Slices FROM dbo.PizzaLog pl WHERE pl.MutantId = mf.Id ORDER BY Slices DESC, DateEaten ASC ) pl; SET STATISTICS IO OFF; SET STATISTICS IO ON; ;WITH cte_pl AS ( SELECT ROW_NUMBER() OVER(PARTITION BY MutantId ORDER BY Slices DESC, DateEaten ASC) AS RowNumber, MutantId, Slices, DateEaten FROM dbo.PizzaLog ) SELECT mf.[Name], cte_pl.DateEaten, cte_pl.Slices FROM dbo.MutantFighters mf INNER JOIN cte_pl ON cte_pl.MutantId = mf.Id WHERE cte_pl.RowNumber < 4 SET STATISTICS IO OFF;
The results above show that CROSS APPLY performed a total of 7 logical reads while ROW_NUMBER() comes in at 3. When executed as a batch, CROSS APPLY accounts for 78% of the cost.
Let's use an OUTER APPLY and a LEFT JOIN for our second test. Here I'm also interested in the character who didn't eat any pizza.
SET STATISTICS IO ON; SELECT mf.[Name], pl.DateEaten, pl.Slices FROM dbo.MutantFighters mf OUTER APPLY ( SELECT TOP (3) DateEaten, Slices FROM dbo.PizzaLog pl WHERE pl.MutantId = mf.Id ORDER BY Slices DESC, DateEaten ASC ) pl; SET STATISTICS IO OFF; SET STATISTICS IO ON; ;WITH cte_pl AS ( SELECT ROW_NUMBER() OVER(PARTITION BY MutantId ORDER BY Slices DESC, DateEaten ASC) AS RowNumber, MutantId, Slices, DateEaten FROM dbo.PizzaLog ) SELECT mf.[Name], cte_pl.DateEaten, cte_pl.Slices FROM dbo.MutantFighters mf LEFT JOIN cte_pl ON cte_pl.MutantId = mf.Id AND cte_pl.RowNumber < 4 SET STATISTICS IO OFF;
As you can see from the screenshots above, the results are nearly identical. Notice that I am pulling in an extra row to account for Splinter. When comparing logical reads and batch cost, the win easily goes to ROW_NUMBER(). With a smaller result set, I wouldn't stress over which method to use. However, if you're using larger tables, it might be worth reconsidering your approach. Please let me know in the comments section if you prefer one over the other.
- You can use the window function ROW_NUMBER() and the APPLY operator to return a specific number of rows from a table expression.
- APPLY comes in two variants CROSS and OUTER. Think of the CROSS like an INNER JOIN and the OUTER like a LEFT JOIN.
- It will largely depend on your preference, but I often choose ROW_NUMBER() due to my fondness for it.
- With a larger dataset, i.e., millions of rows, I'd speculate ROW_NUMBER() might provide a noticeable performance benefit.
- Would you like to learn more about the ROW_NUMBER() window function? Please check out this article I wrote that goes into detail: SQL Server ROW_NUMBER for Ranking Rows
- When generating the data set, I used a recursive CTE to create all the days of February. Edwin Sarmiento wrote an informative article titled, Recursive Queries using Common Table Expressions (CTE) in SQL Server. I highly recommend that you check it out.
- A post wouldn't be complete without referencing something by Aaron Bertrand. Aaron wrote a short article, Advanced JSON Techniques in SQL Server - Part 3, on reading JSON using CROSS APPLY and the OPENJSON function.
About the author
View all my tips
Article Last Updated: 2023-04-11