Calculate the Statistical Mode in SQL Server using T-SQL

By:   |   Comments (8)   |   Related: > Functions System

Problem

Next to the average and the median, there is another statistical value that you can calculate over a set: the mode. This is the value that appears the most often in the set. The average and the median are straight forward to calculate with T-SQL in SQL Server, but how can we determine the mode using T-SQL?

Solution

The mode is a statistical calculation that returns the value in a set that occurs the most frequently. A good example is the most common name in a country. In this example, the set is all the possible names found in that country. In 2013, the mode for male baby names was Noah in the United States.

Let’s illustrate the difference between average, median and mode with a simple numerical example using the set {1, 2, 2, 3, 4, 7, 9}.

• Average = (1 + 2 + 2 + 3 + 4 + 7 + 9) / 7 = 4
• Median (the middle value ordered ascending) = 1, 2, 2, 3, 4, 7, 9 = 3
• Mode (most frequent value) = 1, 2, 2, 3, 4, 7, 9 = 2

Test Set-up

We will use the same test-up as in the tip Creating a box plot graph in SQL Server Reporting Services. There we analyzed the durations of the tickets of different customer service representatives. Now we will search the customer representative with the highest number of tickets for each month. In this case, the representative with the highest ticket count is the mode. For easy reference, here is the T-SQL script for creating the test data:

```CREATE TABLE dbo.DimRepresentative(
SK_Representative INT NOT NULL
,Name    VARCHAR(50) NOT NULL
CONSTRAINT [PK_DimRepresentative] PRIMARY KEY CLUSTERED
(
SK_Representative ASC
)
);
GO

INSERT INTO dbo.DimRepresentative(SK_Representative,Name)
VALUES  (1,'Bruce')
,(2,'Selena')
,(3,'Gordon')
,(4,'Harvey');
GO

CREATE TABLE dbo.FactCustomerService
(SK_CustomerServiceFact INT IDENTITY(1,1) NOT NULL
,SK_Representative  INT NULL
,TicketNumber   UNIQUEIDENTIFIER NOT NULL DEFAULT NEWID() -- degenerate dimension
,TicketTimestamp  DATETIME2(3) NOT NULL
,Duration    INT NOT NULL);
GO

INSERT INTO dbo.FactCustomerService(SK_Representative,TicketTimestamp,Duration)
SELECT
SK_Representative = FLOOR(RAND()*4) + 1 -- random assign a duration to a customer representative
,Duration   = RAND()*100000;
GO 100
```

Since the script uses random values, it is possible that you see different outcomes on your machine than those displayed in this tip.

Calculating the Mode with T-SQL

The first step is to count the number of tickets for each representative per month. This is easily done with the COUNT function and a GROUP BY clause:

```SELECT
Representative = d.Name
,[Month]  = DATENAME(MONTH,[TicketTimestamp])
,cnt   = COUNT(1)
,MonthOrder  = MONTH([TicketTimestamp])
FROM [dbo].[FactCustomerService] f
JOIN [dbo].[DimRepresentative]  d ON f.SK_Representative = d.SK_Representative
GROUP BY Name, DATENAME(MONTH,[TicketTimestamp]),MONTH([TicketTimestamp])
ORDER BY MonthOrder
```

You can for example see that Bruce got the highest count for January: 11 tickets. Now we have to select the highest count for each month. This can be done using the ROW_NUMBER function where we partition on the month. This function is a window function, which means it is calculated after the GROUP BY has been applied to the data. Therefore we can embed the COUNT inside the ROW_NUMBER function so we can sort on it:

```SELECT
Representative = d.Name
,[Month]  = DATENAME(MONTH,[TicketTimestamp])
,cnt   = COUNT(1)
,rid   = ROW_NUMBER() OVER (PARTITION BY DATENAME(MONTH,[TicketTimestamp]) ORDER BY COUNT(1) DESC)
,MonthOrder  = MONTH([TicketTimestamp])
FROM [dbo].[FactCustomerService] f
JOIN [dbo].[DimRepresentative]  d ON f.SK_Representative = d.SK_Representative
GROUP BY Name, DATENAME(MONTH,[TicketTimestamp]),MONTH([TicketTimestamp])
ORDER BY MonthOrder
```

All that is left to do is to select the rows where the row number is equal to one.

```WITH CTE_CountPerMonth AS
(
SELECT
Representative = d.Name
,[Month]  = DATENAME(MONTH,[TicketTimestamp])
,cnt   = COUNT(1)
,rid   = ROW_NUMBER() OVER (PARTITION BY DATENAME(MONTH,[TicketTimestamp]) ORDER BY COUNT(1) DESC)
,MonthOrder  = MONTH([TicketTimestamp])
FROM [dbo].[FactCustomerService] f
JOIN [dbo].[DimRepresentative]  d ON f.SK_Representative = d.SK_Representative
GROUP BY Name, DATENAME(MONTH,[TicketTimestamp]),MONTH([TicketTimestamp])
)
SELECT
Representative
,TicketCount = cnt
,[Month]
FROM CTE_CountPerMonth
WHERE rid = 1
ORDER BY MonthOrder;
```

Handling Ties

But what if there are two representatives holding the same high ticket count in a month? The ROW_NUMBER function doesn’t deal with ties. Right now, the representative is – in theory – randomly returned. To get around this, you can add a tie-breaker to the ROW_NUMBER function. For example, you could also sort on the representative itself. If two representatives with the same high score are found, the one that comes first in the alphabet is returned.

If you want to return actual ties, consider replacing the ROW_NUMBER function with the RANK function, since the RANK function is sensitive to ties.

Next Steps

Koen Verbeeck is a seasoned business intelligence consultant at AE. He has over a decade of experience with the Microsoft Data Platform in numerous industries. He holds several certifications and is a prolific writer contributing content about SSIS, ADF, SSAS, SSRS, MDS, Power BI, Snowflake and Azure services. He has spoken at PASS, SQLBits, dataMinds Connect and delivers webinars on MSSQLTips.com. Koen has been awarded the Microsoft MVP data platform award for many years.

View all my tips

 Tuesday, December 11, 2018 - 2:15:54 AM - Koen Verbeeck Back To Top (78448) Hi Gavin, you'd have to test it to be 100% sure which solution works best for your environment.But I'd think the temp table solution would be more resource intensive, since you have the extra step of writing to a table and then reading the data back in again. Koen

 Monday, December 10, 2018 - 9:47:24 AM - Gavin Back To Top (78442) would this be more or less resource intensive than creating a temp table with a count and then doing a max on the count?

 Tuesday, March 17, 2015 - 7:19:36 AM - Koen Verbeeck Back To Top (36557) Hi Thomas, thanks for the interesting alternative!

 Tuesday, March 17, 2015 - 5:20:17 AM - Thomas Franz Back To Top (36553) Another way to get the statistical mode value is the use of the FIRST_VALUE function. Following to the execution plan it has the same execution cost (but following to STATISTIC_IO it does some additional scans on a worktable so it could be slower for big tables): ` SELECT DISTINCT   Best_Representative = First_value(d.Name) over (PARTITION BY DATENAME(MONTH,[TicketTimestamp]) ORDER BY COUNT(1) DESC)  ,[Month]  = DATENAME(MONTH,[TicketTimestamp])  ,sells = First_value(COUNT(1)) over (PARTITION BY DATENAME(MONTH,[TicketTimestamp]) ORDER BY COUNT(1) DESC)    ,MonthOrder  = MONTH([TicketTimestamp]) FROM [dbo].[FactCustomerService] f JOIN [dbo].[DimRepresentative]  d ON f.SK_Representative = d.SK_Representative GROUP BY Name, DATENAME(MONTH,[TicketTimestamp]),MONTH([TicketTimestamp]) ORDER BY MONTH([TicketTimestamp])` If you remove the DISTINCT and add the Name and a count column you could create a table that shows the sells per month and person and compare it to the best one of this month (without unnecessary joins to the original tables again so it would be faster if you need this type of result f.e. for reports): `SELECT    [Month]  = DATENAME(MONTH,[TicketTimestamp])  ,d.name Representative  ,count(1) sells  ,Best_Representative = First_value(d.Name) over (PARTITION BY DATENAME(MONTH,[TicketTimestamp]) ORDER BY COUNT(1) DESC)  ,Best_sells = First_value(COUNT(1)) over (PARTITION BY DATENAME(MONTH,[TicketTimestamp]) ORDER BY COUNT(1) DESC)    ,MonthOrder  = MONTH([TicketTimestamp]) FROM [dbo].[FactCustomerService] f JOIN [dbo].[DimRepresentative]  d ON f.SK_Representative = d.SK_Representative GROUP BY Name, DATENAME(MONTH,[TicketTimestamp]),MONTH([TicketTimestamp]) ORDER BY MONTH([TicketTimestamp])`

 Monday, March 16, 2015 - 5:50:59 PM - Koen Verbeeck Back To Top (36550) The TOP clause is an interesting solution, but personally I'm not really fond of the WITH TIES option, since it direclty binds what you want to calculate with the mandotory ORDER BY clause. Also the TOP clause doesn't really work when you want to calculate the mode over different categories, such as the months in this example.

 Monday, March 16, 2015 - 3:53:12 PM - meh Back To Top (36546) A simple Employee table of Name and Age.   To get the Mode of the ages of the employees including ties used the following. SELECT TOP 1 WITH TIES     AGE     , COUNT(*) FROM Emp GROUP BY AGEORDER BY COUNT(*) DESC

 Monday, March 16, 2015 - 12:29:57 PM - Timothy A Wiseman Back To Top (36543) The mode can be an important thing to know at times, and it certainly helps you get a better idea about what is representative within a set of data.  Generally, I find it more convenient to pull the data into a more fully featured language to do statistical work with it beyond mere averages.  However, quite a bit can be done in pure T-SQL when that is desirable and this is an excellent example of how to do it that also shows off window functions.

 Monday, March 16, 2015 - 12:26:59 PM - Timothy A Wiseman Back To Top (36542) The Mode of a set can be at times important to know, and having a pure T-SQL Solution is handy as well as a good demonstration of the utility of window functions in T-SQL.  Frequently though I find it more useful to sift through the data in a language that is more accommodating to full statistical analysis rather than trying to find it through SQL Server.