We're currently analyzing a data set and using median to find the central point of the data, but we think that based on some extreme values, this may not actually be the central point. We think that median is useful, but it may be limited for our data set; is there any way that we can limit the extreme ranges for data points and find the central point after that?
Median offers a robust measure in statistics when needing to find a precise measuring point or range for a data set, but it may not be helpful if there are extreme high points and low points and one is trying to find a central tendency. In this example, we'll look at historic natural gas future prices since it has historically had short and extreme price increases and one way to look at these data is to filter these out from the median. We'll be using the MAD - or median absolute deviation - as an additional filter to make sure that we calculate an accurate median price after removing these extreme prices. For a related tip using average, we previously looked at removing outliers from a data set (removing them on the basis of average).
If using the latest edition of SQL Server (2016) with R, you can use R to calculate the MAD, since the function is built in (the same is true with Oracle). Additionally, we'll make the same assumptions as the linked tip covering R about a normally distributed data set. This does not mean that you would want to use this with natural gas futures, but that we'll be using this assumption only for the sake of this tip.
First, we need to know the median and there's a great tip covering calculating median. In this tip, I will use a common table expression because I prefer them for readability when troubleshooting or analyzing a data set; however, we can build a function or stored procedure like the other tips mention for re-use. The median:
;WITH GetMedian AS( SELECT ROW_NUMBER() OVER (ORDER BY NaturalGasPrice DESC) AS MdnCnt , CASE WHEN ((SELECT COUNT(*) FROM tblNaturalGas) % 2 = 0) THEN ((SELECT COUNT(*) FROM tblNaturalGas)/2) ELSE (((SELECT COUNT(*) FROM tblNaturalGas)/2)+1) END AS Median , NaturalGasPrice FROM tblNaturalGas ) SELECT NaturalGasPrice FROM GetMedian WHERE MdnCnt = Median ---- Result: 3.330000
If I stuck with the result here, I might be tempted to think that the central price for natural gas futures is $3.33; but what if I wanted to exclude those extreme values and then calculate the median? The steps to calculating the MAD are:
- Calculate the median value of the data set.
- Subtract the median from each value in the data set, returning the absolute value of these subtracted values.
- Get the new median from the absolute values of the ordered result.
- Multiply the median by the assumed constant relative to the distribution of our data set. In this example, we will use the normal distribution assumption of 1.4826.
- Like outliers, I use the assumption here of 3 median absolute deviations and since no price is negative, I exclusively multiply the result in step four by positive three.
- Finally, using the value in step five, filter out all values in the full data set above that threshold (since we're only using the positive value) and calculate the median without those values included in the data set.
Let's look at this step-by-step in T-SQL:
---- Variable table used for median of second result set after subtracting each value from overall median DECLARE @mad TABLE( NaturalGasPrice_MAD DECIMAL(22,6) ) -- STEP ONE ---- Variables for the median price overall, the median of second result set, and the median absolute deviation filter point DECLARE @medianprice DECIMAL(22,6), @medianmad DECIMAL(22,6), @medianmadfilter DECIMAL(22,6) ---- Get the median price overall ;WITH GetMedian AS( SELECT ROW_NUMBER() OVER (ORDER BY NaturalGasPrice DESC) AS MdnCnt , CASE WHEN ((SELECT COUNT(*) FROM tblNaturalGas) % 2 = 0) THEN ((SELECT COUNT(*) FROM tblNaturalGas)/2) ELSE (((SELECT COUNT(*) FROM tblNaturalGas)/2)+1) END AS Median , NaturalGasPrice FROM tblNaturalGas ) SELECT @medianprice = NaturalGasPrice FROM GetMedian WHERE MdnCnt = Median ---- Output that value SELECT @medianprice AS Median_NaturalGasPrice_NoMADFilter -- STEP TWO ---- Subtract the median price from each value and take the absolute value of each of these points INSERT INTO @mad SELECT ABS(NaturalGasPrice - @medianprice) FROM tblNaturalGas -- STEP THREE ---- Get the median of the absolute value from the overall median price subtracted from each data point ;WITH GetMedianForMad AS( SELECT ROW_NUMBER() OVER (ORDER BY NaturalGasPrice_MAD DESC) AS MdnCnt , CASE WHEN ((SELECT COUNT(*) FROM @mad) % 2 = 0) THEN ((SELECT COUNT(*) FROM @mad)/2) ELSE (((SELECT COUNT(*) FROM @mad)/2)+1) END AS Median , NaturalGasPrice_MAD FROM @mad ) SELECT @medianmad = NaturalGasPrice_MAD FROM GetMedianForMad WHERE MdnCnt = Median ---- Output what we have so far: SELECT @medianmad AS Median_MAD , (@medianmad*1.4826) AS Median_MAD_NormalDistribution ---- Assuming normal distribution; b = 1.4826 , ((@medianmad*1.4826)*3) AS Median_MAD_UpperRange ---- Like outliers, I generally use 3 here; however, 2.7 may be more appropriate for natural gas futures -- STEP FOUR and FIVE ---- Save the median absolute deviation filter point SELECT @medianmadfilter = ((@medianmad*1.4826)*3) -- STEP SIX ---- What is the median when we filter out the extreme values based on the MAD? ;WITH GetMedian_FilterMAD AS( SELECT ROW_NUMBER() OVER (ORDER BY NaturalGasPrice DESC) AS MdnCnt , CASE WHEN ((SELECT COUNT(*) FROM tblNaturalGas WHERE NaturalGasPrice <= @medianmadfilter) % 2 = 0) THEN ((SELECT COUNT(*) FROM tblNaturalGas WHERE NaturalGasPrice <= @medianmadfilter)/2) ELSE (((SELECT COUNT(*) FROM tblNaturalGas WHERE NaturalGasPrice <= @medianmadfilter)/2)+1) END AS Median , NaturalGasPrice FROM tblNaturalGas WHERE NaturalGasPrice <= @medianmadfilter ) SELECT NaturalGasPrice AS MedianNaturalGasPrice_FilteredByMAD FROM GetMedian_FilterMAD WHERE MdnCnt = Median
Result from the above code:
If I assume that $3.33 doesn't tell me about the price of natural gas futures based on the data set that I have and I'm aiming to find the most central price, I can look at using the above to calculate the absolute deviation surrounding the median. The main key is what I believe about whether my data set has extreme values, or whether I think it's a good distribution. In most cases, I will calculate the median of both and compare, then look at the data set and re-evaluate the extreme values.
- Consider an example of median household income in the United States, which is currently $53,600 a year. If we organized all the data without the extremely expensive cities, how different or similar would the medians be?
- Remember, that median is a precise measure already, so when we consider additional filtering we should be careful about over-analysis.
Last Updated: 2016-09-01
About the author
View all my tips