By: John Miner | Comments | Related: > TSQL
Problem
On any given day, a typical business user might want to know a distinct list of values in a particular column. For instance, give me a list of unique job titles in the Adventure Works company. If the number of job titles is extremely large compared to the number of employees, a restructuring of titles by the Human Resources department might be in order. Additionally, we might want a distinct count of job titles versus a total count of employees. The first example uses the DISTINCT clause with the SELECT statement and the second example demonstrates the use of the DISTINCT clause within an AGGREGATE FUNCTION.
In this brief SQL tutorial, we will review both use cases the DISTINCT clause in detail.
Solution
Today, we are going to re-use the IaaS and PaaS databases that we setup in a prior tip as our lab environment. Before we craft queries against the Adventure Works sample database, I will show you how to create a simple dataset (table) in TEMPDB that can be used to test sample queries. One appealing fact about a simple dataset is the ability to see the results of a query given a limited number of rows. I have used similar tables in the past when posting answers to questions on Stack Overflow. The complete set of Transact SQL examples is enclosed at the end of the article.
Pets Sample Dataset
I truly believe that every pet deserves a loving home. That is why we rescued both a Female Terrier mix (dog) named Ella and a Male Siberian (cat) named Ziggy from local animal shelters. Our tiny example PETS table will have 6 records in which data reflect the images seen below. The images were taken from Oxford Pets Dataset which can be used by Machine Learning Algorithms to classifying images as either cat or dog. In short, just want to make my PETS table more interesting with pictures!
The T-SQL syntax uses a table value constructor to create a derived table. The selected data is inserted into local temporary table called #PETS that stays in scope until the query window until the session is closed. The following query can be re-executed anytime since the table is deleted and recreated each time.
-- Drop existing table DROP TABLE IF EXISTS #PETS -- Create very simple table SELECT PETS.* INTO #PETS FROM ( VALUES ('CAT','SAMMY','ABYSSINIAN','MALE', 2.5), ('DOG','HENRY','AMERICAN BULLDOG','MALE', 1.3), ('CAT','LUCY','BOMBAY','FEMALE', 4.0), ('DOG','CLEO', 'AMERICAN PITBULL','FEMALE', 6.9), ('CAT','BELLA','PERSIAN','FEMALE', 8.7), ('DOG','ZAC','BASSET HOUND','MALE', 4.0) ) AS PETS ( PET_TYPE, PET_NAME, PET_BREED, PET_GENDER, PET_AGE ); GO -- Show the data SELECT * FROM #PETS; GO
The image below shows the output from SELECT statement. Now that we have a simple test table, we can start talking about the SQL SELECT DISTINCT clause.
Distinct with Column Combinations
In its simplest form, the DISTINCT clause returns a distinct list of values from one or more columns. Below, we can see two statements that logical return the same set of records. The output answers "What type of pets are stored in the table?". We have both CATS and DOGS. I wonder if the execution plans are the same for both queries?
Adding more columns to the query tells the query engine to find all the distinct combinations of PET TYPES and PET GENDERS. Again, we can use either a SQL GROUP BY or DISTINCT clause to return the desired results. Please see T-SQL code below for used in this example. In our case, we have both MALE and FEMALE CATS and DOGS. Please see image below for output.
-- Group By Example SELECT PET_TYPE, PET_GENDER FROM #PETS GROUP BY PET_TYPE, PET_GENDER; GO -- Distinct Example SELECT DISTINCT PET_TYPE, PET_GENDER FROM #PETS; GO
When looking at two different T-SQL statements that return the same result set, a developer should look at both the estimated and/or actual execution plans. Here is a MSDN link if you do not know how. One might be surprised that the two queries execute the same. Since the temporary table has not primary key, it is considered a heap. By default, a full table scan will be used to read up the data. A sort distinct operator takes the 6 unique values as input and comes up with 4 unique combinations as output.
One more simple example on distinct columns. If we add PET BREED to the list of columns, then all six, records are returned. Why is this result happening? The fact is that each record represents a unique name of animal breeding. Therefore, we have six unique records.
To recap, both the GROUP BY and DISTINCT clauses come up with the same results when trying to select combinations of values from 1 to N columns.
Logical Processing
I would like to answer the question "How does a SELECT statement get evaluated?". However, we first need a little background in computer language design to better understand the Transact Structured Query Language (T-SQL).
The Transact SQL language is just like any other computer language that humans have built. There are two phases to processing a program written in a given language. The first phase is called the lexical analyzer. The purpose of this phase it to see if the code is syntactically correct. The second phase is the parser for either a compiler or an interpreter. For instance, the C language compiles the a given program into assembly (object code). It does not actual execute the program. The BASIC language which is an interpreter will execute the program line by line until a end of program is found. Both C and BASIC are general purpose languages. I did a lot of programming in these languages in the 1990s.
The final answer is the fact that Transact SQL is a domain specific language which is interpreted by the database engine. With any language, the designer has to come up with a set of rules. One set of rules determines how the SELECT statement is logically processed. Of course, the optimizer that creates the execution plan might actually change the order if necessary. The image below is taken from the MSDN webpage.
To recap, a solid understanding of the execution order of a SELECT statement is valuable when trying to optimize query plans. Today, we have seen both SQL statements execute in the same amount of time. There will be times in which two statements will return the same results but with widely different execution times.
SELECT DISTINCT Syntax with Aggregate Functions
The simple PETS dataset was manually created by this author. However, there are animal shelters that might have similar data. They might want to know the average age of a given pet. Does the following T-SQL generate the same results?
-- Group By Example SELECT AVG(PET_AGE) AS REG_AVG FROM #PETS GO -- Distinct Example SELECT AVG(DISTINCT PET_AGE) AS DST_AVG FROM #PETS GO
The image below shows two different averages. Therefore, the results are not the same. Both statements throw out any NULL values as defined by the AVG function. However, the second call (distinct example) will also throw out duplicate rows in the PET_AGE column. We have two animals that are exactly 4 years old. This removal causes the two statements to come up with different results.
What happens if we ask for distinct average by pet type? Since the duplicate age was between one cat record and one dog record, the duplicate records issue is removed. There are no longer any duplicate cat or dog records.
If we look at the actual query plan, we will see that same set of operators are used. In general, any type of sort is an expensive operation.
One last example before we play with T-SQL statements against the Adventure Works database. Setting the PET_AGE to either NULL or 4.0 will cause the COUNT function to show different totals. Does the query plan different between these two statements?
Not only are the actual execution plans different. But the first statement executes 4 times faster than the second statement. Again, any type of sort is an expensive operation. The first statement just counts the number of records. The second statement sorts the data by PET_AGE. Then, it counts the number of records while tossing any duplicate or NULL values.
Reducing Job Titles
Let us now revisit the idea that the Adventure Works company sample database has too many job titles for the number of employees. I will be using the virtual machine (IaaS) which has SQL Server 2019 and the most recent version of SQL Server Management Studio (SSMS). The image below shows my remote desktop connect to this virtual machine.
The first query a developer might try is to get a count of job titles as seen below. However, this statement does not show the break down by employees by job title.
Our second attempt to write the query might be a distinct selection of job titles. The results of the two queries does show us that the 290 employees have 67 job titles. That is a lot of job titles. But it does not show the job titles that are seldomly used. These are probably the ones that HR might discontinue.
The best queries are to count the number of employees by job title. The two statements are listed below that return the same results and have the same execution plan.
-- Group By Example SELECT JobTitle, COUNT(*) AS Total FROM HumanResources.Employee GROUP BY JobTitle -- Distinct Example SELECT JobTitle, COUNT(DISTINCT BusinessEntityID) AS Total FROM HumanResources.Employee GROUP BY JobTitle GO
The output of the first query is below. I added a SQL ORDER BY clause which uses an additional sort operator. It might be wiser choice to sort by Total column ascending. This would list all the titles that are seldomly used.
In a nutshell, we can supply the Human Resources Department with a list of Job Titles by number of employees who have that title.
Identifying Foreign Transactions
The Finance Department at Adventure Works wants to know how many international orders were processed in 2011. There is a cost associated with using a currency transaction service for on-line orders. The company is considering eliminating certain countries that do not have a high number of sales. Therefore, we want to know the number of sales by foreign currently.
The T-SQL snippet below has a SQL JOIN from the SalesOrderHeader table to the CurrencyRate table. We are interested in the year of the order, the currency used in the purchase and the count of orders.
-- Group By Example SELECT YEAR(H.OrderDate) AS OrderYear, R.ToCurrencyCode, COUNT(*) AS Total FROM Sales.SalesOrderHeader AS H LEFT JOIN Sales.CurrencyRate AS R ON H.CurrencyRateID = R.CurrencyRateID WHERE YEAR(H.OrderDate) = 2011 GROUP BY YEAR(H.OrderDate), R.ToCurrencyCode GO
The image below uses the GROUP BY clause to solve this problem.
The following table translates the currency codes to countries. We can see that France and Denmark are candidates as countries to not sell to.
What about the DISTINCT clause? The T-SQL snippet below answers the following question "How many distinct foreign currencies are used by orders in the year of 2011?". Since NULL values are dropped, the United States Dollar is correctly removed from the results.
Use DISTINCT and GROUP BY in the correct queries to obtain the required results.
Summary
The DISTINCT clause be used with a list of columns to display all unique combinations of values stored in the table. In our PETS table, we noticed that we had two pet types and two pet genders. Therefore, four unique records were generated in the output windows. This result is a cross product of the number of distinct values in each column, 2 x 2. When used with columns, the DISTINCT clause counts a NULL value as a valid combination. If you are interested, I enclosed an example for the Shipping Department of the Adventures Works Company that finds DISTINCT shipping addresses by CITY, STATE and COUNTRY. This might be helpful when brokering deals with shipping companies to remote locations.
The use of a table value constructor for a derived table is a great way to create a small, simple dataset to test your queries on. I use this technique many times when teaching. Since the tiny table is in TEMPDB, there is no need to create even a database to explore the syntax and results of a query.
I did not go into detail about the logical execution order of the SELECT statement. However, it is a good reference when you trying to optimize a query. Make sure to always review the actual execution plan. You might find two Transaction SQL statements that produce the same results but execute with widely different runtimes.
Last but least, the DISTINCT clause can be use with aggregate functions. Some functions like MIN() and MAX() will return the same results regardless if duplicates occur in the data. Many other statements are affected by the removal of duplicate entries or the tossing away of null values. I included an example in the final T-SQL script that compares the column Total Due when each aggregate function is called normally versus when the function is called with the DISTINCT clause.
Enclosed is the full T-SQL script used during this tutorial. Next time, I will be talking about how to LEFT JOIN tables in a SELECT statement.
Next Steps
Check out these related articles
- Problem using DISTINCT in case insensitive SQL Server databases
- SQL Server 2019 APPROX_COUNT_DISTINCT Function
- SQL UNION vs. UNION ALL
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips