Introduction to Subqueries in SQL Server

By: Tim Wiseman | Comments (1) | Related: > TSQL

Problem

How and when should I use subqueries while working with Microsoft SQL Server? Under what circumstances should I use subqueries in SQL Server? In this tip we look at the best way to use subqueries.

Solution

What are subqueries?

Subqueries are queries that are nested inside of another query or statement. They are permitted wherever SQL Server would allow an expression, and are indicated by enclosing the subquery in parenthesis. For instance, using the sample AdventureWorks database, we might create a query to find any employees who have more vacation available than the average:

SELECT
  BusinessEntityID,
  LoginID,
  JobTitle,
  VacationHours
FROM 
  HumanResources.Employee E1
WHERE 
  VacationHours > (SELECT
    AVG(VacationHours)
    FROM HumanResources.Employee E2)

In that example, we used the subquery in the WHERE clause and it returned a single value.

The part below is the subquery from the example above. It is the part of the query inside the parentheses.

    (SELECT
    AVG(VacationHours)
    FROM HumanResources.Employee E2)

We can put subqueries in the FROM clause and access the rows returned from them like any other table. For instance, if we want to know which employees have more vacation hours than average for their job title, we might write:

SELECT
  E1.BusinessEntityID,
  E1.LoginID,
  E1.JobTitle,
  E1.VacationHours,
  Sub.AverageVacation --Drawn from the subquery
FROM HumanResources.Employee E1
JOIN (SELECT
      JobTitle,
      AVG(VacationHours) AverageVacation
      FROM HumanResources.Employee E2
      GROUP BY JobTitle) sub
ON E1.JobTitle = Sub.JobTitle
WHERE E1.VacationHours > Sub.AverageVacation
ORDER BY E1.JobTitle

Correlated Subqueries

Correlated subqueries are subqueries which reference the main query or statement. This allows them to do a wide array of comparisons or manipulation. For instance, if we want to know which employees had their pay rates changed effective in 2003, we could use a correlated subquery like:

SELECT
  *
FROM HumanResources.Employee E
WHERE EXISTS (SELECT
                *
              FROM HumanResources.EmployeePayHistory EPH
              WHERE E.BusinessEntityID = EPH.BusinessEntityID --Comparing a value from the outer query
              AND YEAR(EPH.RateChangeDate) = 2002)

That correlated subquery compares values from the base table of the outer query with values from the table in the inner query. It would be relatively easy to rewrite that particular query by joining the tables rather than using a subquery. That is fairly common and many uses for subqueries could be reformulated to avoid using the subquery and instead use techniques like joins.

A slightly more sophisticated use for correlated subqueries allows for concatenations across rows, at least in SQL Server 2005 and later. For instance, if we wanted a concatenated list of all the titles that were associated with each organizational level from the HumanResources part of Adventure Works, we could use:

SELECT
  E.OrganizationLevel,
  STUFF((SELECT DISTINCT
    ', ' + E2.JobTitle
  FROM HumanResources.Employee E2
  WHERE E.OrganizationLevel = E2.OrganizationLevel --Correlation between inner and outer queries
  FOR xml PATH ('')), 1, 1, '')
FROM HumanResources.Employee E
GROUP BY E.OrganizationLevel

That type of query would be difficult to reframe as a JOIN, though it can be rewritten using user defined functions. Jeff Moden has looked at the performance implications of the two techniques and found that this style, using the correlated subquery, is substantially faster than using the user defined functions.

While correlated subqueries like that can perform quite well, they can often cause performance issues since they may force the server to run the subquery for every row considered by the outer statement. Jeff Moden has an article detailing why this occurs and refers to one possible scenario as creating a triangular join. For performance reasons, correlated subqueries need to be used cautiously.

Common Table Expressions (CTEs)

Common Table Expressions (CTEs) were introduced in SQL Server 2005 and they are essentially a different syntax for creating a subquery. They can sometimes make it easier to read and maintain the code if it will be reused. This is especially so when you have multiple subqueries.

A CTE is formatted as: WITH [CTE Name] as ([query]). The CTE can then be referenced by name, but only within the statement it was created in. Any statement coming immediately before a CTE is declared must be terminated with a semicolon (;). You can have more than one CTE within a statement if they are separated by commas.

For instance, if we wanted details on employees that have worked in more than one department or worked more than one designated shift, we could use CTEs to handle the concatenation.

--Any previous query in the same batch must be terminated by a semicolon (;)
WITH ConcatDepartments
AS (SELECT
  EDH.BusinessEntityID,
  STUFF((SELECT
    ', ' + D.name
  FROM HumanResources.EmployeeDepartmentHistory EDH2
  JOIN HumanResources.Department D
    ON EDH2.DepartmentID = D.DepartmentID
  WHERE EDH2.BusinessEntityID = EDH.BusinessEntityID --correllate inner and outer
  FOR xml PATH ('')), 1, 1, '') AS Departments,
  COUNT(*) AS NumDepartments
FROM HumanResources.EmployeeDepartmentHistory EDH
GROUP BY EDH.BusinessEntityID), --The comma allows another CTE in the same statement

ConcatShifts
AS (SELECT
  EDH.BusinessEntityID,
  STUFF((SELECT
    ', ' + S.Name
  FROM HumanResources.EmployeeDepartmentHistory EDH2
  JOIN HumanResources.[Shift] S
    ON EDH2.DepartmentID = S.ShiftID
  WHERE EDH2.BusinessEntityID = EDH.BusinessEntityID --correllate inner and outer
  FOR xml PATH ('')), 1, 1, '') AS Shifts,
  COUNT(*) AS NumShifts
FROM HumanResources.EmployeeDepartmentHistory EDH
GROUP BY EDH.BusinessEntityID)

SELECT
  P.BusinessEntityID,
  P.FirstName,
  P.LastName,
  CD.Departments,
  CS.Shifts
FROM Person.Person P
INNER JOIN ConcatDepartments CD
  ON P.BusinessEntityID = CD.BusinessEntityID
INNER JOIN ConcatShifts CS
  ON p.BusinessEntityID = cs.BusinessEntityID
WHERE CD.NumDepartments > 1
OR CS.NumShifts > 1;

Recursive CTEs

A CTE can also reference itself, making it recursive. Among other things, this can be used for counting, though it may not be efficient or scale well, as Jeff Moden explained. For a straightforward example:

WITH CountingCTE
AS (SELECT
  1 AS num
UNION ALL
SELECT
  num + 1
FROM CountingCTE
WHERE num < 7)
SELECT
  *
FROM CountingCTE

In more sophisticated usage, recursive CTEs are useful for working with a parent-child hierarchy.

Next Steps

As Arshad Ali explains, one use for subqueries is to remove duplicate rows, especially if there is an identity column on the table.
Edwin Sarmiento explained how to use CTES for recursive queries.
Andy Brown has provided examples of using CTES to make code with subqueries easier to read.

About the author

Tim Wiseman is a SQL Server DBA and Python Developer in Las Vegas.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips

Monday, July 15, 2024 - 5:23:36 AM - Adam Sztamborski	Back To Top (92389)
There is a mistake in ConcatShifts CTE, you join ON EDH2.DepartmentID = S.ShiftID instead of ON EDH2.ShiftID = S.ShiftID hence incorrect results