SQL Server Statistical Window Functions PERCENTILE_DISC and PERCENTILE_CONT


So far in this tutorial we have seen different types of window functions: aggregate, ranking and analytical (lag and lead. In this part, weíll highlight the statistical functions, which are part of the analytical functions.


Both functions calculate a percentile based on the discrete and the continuous distribution correspondingly. The main difference between the two functions is that PERCENTILE_DISC will return a value from the data set while PERCENTILE_CONT will interpolate values.

The syntax for these functions is a bit different than other window functions:


The WITHIN GROUP clause specifies the numerical vales over which the percentile should be computed over. The percentile literal indicates the percentile value to calculate, which is a decimal number between 0.0 and 1.0. Letís calculate the 75-percentile of the online sales for each product category.

    ,[Sales - 75th perc cont] = PERCENTILE_CONT(0.75)
                                        WITHIN GROUP(ORDER BY [f].[SalesAmount])
                                        OVER(PARTITION BY [c].[EnglishProductCategoryName])
    ,[Sales - 75th perc disc] = PERCENTILE_DISC(0.75)
                                        WITHIN GROUP(ORDER BY [f].[SalesAmount])
                                        OVER(PARTITION BY [c].[EnglishProductCategoryName])
FROM [dbo].[FactInternetSales]      f
JOIN [dbo].[DimProduct]             p ON [p].[ProductKey]            = [f].[ProductKey]
JOIN [dbo].[DimProductSubcategory]  s ON [s].[ProductSubcategoryKey] = [p].[ProductSubcategoryKey]
JOIN [dbo].[DimProductCategory]     c ON [c].[ProductCategoryKey]    = [s].[ProductCategoryKey]; 

In this case, both the discrete and the continuous distribution yield the same results:

75th percentile

The 75th percentile of Bikes is 2384.07, which means that for 75% of sales the amount was lower than 2384.07, while for 25% is was higher. When specifying 0.5 as the percentile, you are calculating the median value. With these two new functions in your toolbox, you can easily calculate summary statistics for a data set. Letís do this for the employee salaries:

    ,MinRate    = MIN([BaseRate]) OVER (PARTITION BY [DepartmentName])
    ,[25thPerc] = PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY [BaseRate]) OVER (PARTITION BY [DepartmentName])
    ,MedianRate = PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY [BaseRate]) OVER (PARTITION BY [DepartmentName])
    ,AvgRate    = AVG([BaseRate]) OVER (PARTITION BY [DepartmentName])
    ,[75thPerc] = PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY [BaseRate]) OVER (PARTITION BY [DepartmentName])
    ,MaxRate    = MAX([BaseRate]) OVER (PARTITION BY [DepartmentName])
FROM [dbo].[DimEmployee]

The results:

summary statistics

The CUME_DIST and PERCENT_RANK functions

The CUME_DIST function is closely to the PERCENTILE_DIST function: it returns the relative position of a value compared with other values in the data set. In other words, it returns the percentile of a specific value. The PERCENT_RANK function is similar to CUME_DIST.

Letís illustrate the concept using the values we found in the previous paragraph, by retrieving the percentiles for the different sales amount of the category accessories:

    ,CumeDist       = CUME_DIST()    OVER (ORDER BY f.[SalesAmount])
    ,PercentRank    = PERCENT_RANK() OVER (ORDER BY f.[SalesAmount])
FROM [dbo].[FactInternetSales]      f
JOIN [dbo].[DimProduct]             p ON [p].[ProductKey]            = [f].[ProductKey]
JOIN [dbo].[DimProductSubcategory]  s ON [s].[ProductSubcategoryKey] = [p].[ProductSubcategoryKey]
JOIN [dbo].[DimProductCategory]     c ON [c].[ProductCategoryKey]    = [s].[ProductCategoryKey]
WHERE [EnglishProductCategoryName] = 'Accessories';

The results are a bit surprising:

query results

The CUME_DIST function returns 92%, while in the previous section the value 34.99 was determined as the 75th percentile. Why this big of a difference? Remember, we are working with discrete functions here. The previous value, 32.60 has the 74th percentile, which is lower than 75 so the next value has to be taken. In this example, thatís 34.99 which happens to have a much higher CUME_DIST value. The PERCENT_RANK function on the other hand is much closer to 75%.

Additional Information

Comments For This Article

get free sql tips
agree to terms