Learn more about SQL Server tools

mssqltips logo
 

Tutorials          DBA          Dev          BI          Career          Categories          Webcasts          Whitepapers          Today's Tip          Join

Tutorials      DBA      Dev      BI      Categories      Webcasts

DBA    Dev    BI    Categories

 

Calculate the Statistical Mode in SQL Server using T-SQL


By:   |   Read Comments (6)   |   Related Tips: More > Functions - System

Attend these FREE MSSQLTips webcasts >> click to register


Problem

Next to the average and the median, there is another statistical value that you can calculate over a set: the mode. This is the value that appears the most often in the set. The average and the median are straight forward to calculate with T-SQL in SQL Server, but how can we determine the mode using T-SQL?

Solution

The mode is a statistical calculation that returns the value in a set that occurs the most frequently. A good example is the most common name in a country. In this example, the set is all the possible names found in that country. In 2013, the mode for male baby names was Noah in the United States.

Let’s illustrate the difference between average, median and mode with a simple numerical example using the set {1, 2, 2, 3, 4, 7, 9}.

  • Average = (1 + 2 + 2 + 3 + 4 + 7 + 9) / 7 = 4
  • Median (the middle value ordered ascending) = 1, 2, 2, 3, 4, 7, 9 = 3
  • Mode (most frequent value) = 1, 2, 2, 3, 4, 7, 9 = 2

Test Set-up

We will use the same test-up as in the tip Creating a box plot graph in SQL Server Reporting Services. There we analyzed the durations of the tickets of different customer service representatives. Now we will search the customer representative with the highest number of tickets for each month. In this case, the representative with the highest ticket count is the mode. For easy reference, here is the T-SQL script for creating the test data:

CREATE TABLE dbo.DimRepresentative(
  SK_Representative INT NOT NULL
 ,Name    VARCHAR(50) NOT NULL
 CONSTRAINT [PK_DimRepresentative] PRIMARY KEY CLUSTERED 
 (
  SK_Representative ASC
 )
);
GO

INSERT INTO dbo.DimRepresentative(SK_Representative,Name)
VALUES  (1,'Bruce')
  ,(2,'Selena')
  ,(3,'Gordon')
  ,(4,'Harvey');
GO

CREATE TABLE dbo.FactCustomerService
 (SK_CustomerServiceFact INT IDENTITY(1,1) NOT NULL
 ,SK_Representative  INT NULL
 ,TicketNumber   UNIQUEIDENTIFIER NOT NULL DEFAULT NEWID() -- degenerate dimension
 ,TicketTimestamp  DATETIME2(3) NOT NULL
 ,Duration    INT NOT NULL);
GO

INSERT INTO dbo.FactCustomerService(SK_Representative,TicketTimestamp,Duration)
SELECT
  SK_Representative = FLOOR(RAND()*4) + 1 -- random assign a duration to a customer representative
 ,TicketTimestamp = DATEADD(DAY,RAND()*100,'2014-01-01')
 ,Duration   = RAND()*100000;
GO 100

Since the script uses random values, it is possible that you see different outcomes on your machine than those displayed in this tip.

Calculating the Mode with T-SQL

The first step is to count the number of tickets for each representative per month. This is easily done with the COUNT function and a GROUP BY clause:

SELECT
  Representative = d.Name
 ,[Month]  = DATENAME(MONTH,[TicketTimestamp])
 ,cnt   = COUNT(1)
 ,MonthOrder  = MONTH([TicketTimestamp])
FROM [dbo].[FactCustomerService] f
JOIN [dbo].[DimRepresentative]  d ON f.SK_Representative = d.SK_Representative
GROUP BY Name, DATENAME(MONTH,[TicketTimestamp]),MONTH([TicketTimestamp])
ORDER BY MonthOrder

Counting the tickets per representative per month

You can for example see that Bruce got the highest count for January: 11 tickets. Now we have to select the highest count for each month. This can be done using the ROW_NUMBER function where we partition on the month. This function is a window function, which means it is calculated after the GROUP BY has been applied to the data. Therefore we can embed the COUNT inside the ROW_NUMBER function so we can sort on it:

SELECT
  Representative = d.Name
 ,[Month]  = DATENAME(MONTH,[TicketTimestamp])
 ,cnt   = COUNT(1)
 ,rid   = ROW_NUMBER() OVER (PARTITION BY DATENAME(MONTH,[TicketTimestamp]) ORDER BY COUNT(1) DESC)
 ,MonthOrder  = MONTH([TicketTimestamp])
FROM [dbo].[FactCustomerService] f
JOIN [dbo].[DimRepresentative]  d ON f.SK_Representative = d.SK_Representative
GROUP BY Name, DATENAME(MONTH,[TicketTimestamp]),MONTH([TicketTimestamp])
ORDER BY MonthOrder

Adding the row number

All that is left to do is to select the rows where the row number is equal to one.

WITH CTE_CountPerMonth AS
(
 SELECT
   Representative = d.Name
  ,[Month]  = DATENAME(MONTH,[TicketTimestamp])
  ,cnt   = COUNT(1)
  ,rid   = ROW_NUMBER() OVER (PARTITION BY DATENAME(MONTH,[TicketTimestamp]) ORDER BY COUNT(1) DESC)
  ,MonthOrder  = MONTH([TicketTimestamp])
 FROM [dbo].[FactCustomerService] f
 JOIN [dbo].[DimRepresentative]  d ON f.SK_Representative = d.SK_Representative
 GROUP BY Name, DATENAME(MONTH,[TicketTimestamp]),MONTH([TicketTimestamp])
)
SELECT
  Representative
 ,TicketCount = cnt
 ,[Month]
FROM CTE_CountPerMonth
WHERE rid = 1
ORDER BY MonthOrder;

The final results

Handling Ties

But what if there are two representatives holding the same high ticket count in a month? The ROW_NUMBER function doesn’t deal with ties. Right now, the representative is – in theory – randomly returned. To get around this, you can add a tie-breaker to the ROW_NUMBER function. For example, you could also sort on the representative itself. If two representatives with the same high score are found, the one that comes first in the alphabet is returned.

If you want to return actual ties, consider replacing the ROW_NUMBER function with the RANK function, since the RANK function is sensitive to ties.

Next Steps


Last Update:


signup button

next tip button



About the author
MSSQLTips author Koen Verbeeck Koen Verbeeck is a BI professional, specializing in the Microsoft BI stack with a particular love for SSIS.

View all my tips





Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name    *Email    Notify for updates 


SQL tips:

*Enter Code refresh code     



Tuesday, March 17, 2015 - 7:19:36 AM - Koen Verbeeck Back To Top

Hi Thomas,

thanks for the interesting alternative!


Tuesday, March 17, 2015 - 5:20:17 AM - Thomas Franz Back To Top

Another way to get the statistical mode value is the use of the FIRST_VALUE function. Following to the execution plan it has the same execution cost (but following to STATISTIC_IO it does some additional scans on a worktable so it could be slower for big tables):

 SELECT DISTINCT
   Best_Representative = First_value(d.Name) over (PARTITION BY DATENAME(MONTH,[TicketTimestamp]) ORDER BY COUNT(1) DESC)
  ,[Month]  = DATENAME(MONTH,[TicketTimestamp])
  ,sells = First_value(COUNT(1)) over (PARTITION BY DATENAME(MONTH,[TicketTimestamp]) ORDER BY COUNT(1) DESC) 
  ,MonthOrder  = MONTH([TicketTimestamp])
 FROM [dbo].[FactCustomerService] f
 JOIN [dbo].[DimRepresentative]  d ON f.SK_Representative = d.SK_Representative
 GROUP BY Name, DATENAME(MONTH,[TicketTimestamp]),MONTH([TicketTimestamp])
 ORDER BY MONTH([TicketTimestamp])

If you remove the DISTINCT and add the Name and a count column you could create a table that shows the sells per month and person and compare it to the best one of this month (without unnecessary joins to the original tables again so it would be faster if you need this type of result f.e. for reports):

SELECT 
   [Month]  = DATENAME(MONTH,[TicketTimestamp])
  ,d.name Representative
  ,count(1) sells
  ,Best_Representative = First_value(d.Name) over (PARTITION BY DATENAME(MONTH,[TicketTimestamp]) ORDER BY COUNT(1) DESC)
  ,Best_sells = First_value(COUNT(1)) over (PARTITION BY DATENAME(MONTH,[TicketTimestamp]) ORDER BY COUNT(1) DESC) 
  ,MonthOrder  = MONTH([TicketTimestamp])
 FROM [dbo].[FactCustomerService] f
 JOIN [dbo].[DimRepresentative]  d ON f.SK_Representative = d.SK_Representative
 GROUP BY Name, DATENAME(MONTH,[TicketTimestamp]),MONTH([TicketTimestamp])
 ORDER BY MONTH([TicketTimestamp])

 


Monday, March 16, 2015 - 5:50:59 PM - Koen Verbeeck Back To Top

The TOP clause is an interesting solution, but personally I'm not really fond of the WITH TIES option, since it direclty binds what you want to calculate with the mandotory ORDER BY clause. Also the TOP clause doesn't really work when you want to calculate the mode over different categories, such as the months in this example.


Monday, March 16, 2015 - 3:53:12 PM - meh Back To Top

A simple Employee table of Name and Age.

 

To get the Mode of the ages of the employees including ties used the following.

SELECT TOP 1 WITH TIES
     AGE
     , COUNT(*)
FROM Emp
GROUP BY AGE
ORDER BY COUNT(*) DESC


Monday, March 16, 2015 - 12:29:57 PM - Timothy A Wiseman Back To Top

The mode can be an important thing to know at times, and it certainly helps you get a better idea about what is representative within a set of data.  Generally, I find it more convenient to pull the data into a more fully featured language to do statistical work with it beyond mere averages.  However, quite a bit can be done in pure T-SQL when that is desirable and this is an excellent example of how to do it that also shows off window functions.


Monday, March 16, 2015 - 12:26:59 PM - Timothy A Wiseman Back To Top

The Mode of a set can be at times important to know, and having a pure T-SQL Solution is handy as well as a good demonstration of the utility of window functions in T-SQL.  Frequently though I find it more useful to sift through the data in a language that is more accommodating to full statistical analysis rather than trying to find it through SQL Server.


Learn more about SQL Server tools