Binning SQL Server Data to Reduce Column Cardinality

By:   |   Comments (4)   |   Related: > Analysis Services Development


Problem

I have a SQL Server data set based on age with values from 0 to 100. From this dataset I need to build a histogram showing the age distribution in 10 year increments. Ultimately, I need to use this data as input for a SQL Server Analysis Services neural network algorithm? How can I prepare the data to perform the data mining?

Solution

Reducing the number of distinct values in a column can assist in reducing the processing time of data mining models, make for a more generalized model, and help in the interpretation of the results. As shown in a previous tip, How to handle the MAXIMUM_STATES warning message in the SSAS Neural Network Data Mining Model, SQL Server Analysis Services has parameters for the cardinality of the columns used as input to the neural network.

Binning, also known as discretization, is a way of putting similar data values into bins. We will accomplish this binning in T-SQL by using a CASE expression. In this tip, I have a table that contains 101,706 records. This table has an age column with age values from 0 to 100. We will assign the age values to an age range value in increments of 10.

Age column statistics

There are 100 distinct values for the age column.

100 distinct values for the age column

We want to preserve the original age value and add a varchar column to store the new value.

ALTER TABLE ADD COLUMN

Next, we run an UPDATE statement with a CASE expression to assign the age values an age range bin value.

UPDATE statement with a CASE expression to assign the age values an age range bin value

Finally, we select the count of each age range value using a GROUP BY clause.

Counts per age range bin value
Next Steps


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Dallas Snider Dr. Dallas Snider is an Assistant Professor in the Computer Science Department at the University of West Florida and has 18+ years of SQL experience.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips



Comments For This Article




Friday, March 23, 2018 - 2:35:03 PM - MsDee Back To Top (75511)

 [ is inclusive

( is not inclusive

 

 


Wednesday, March 2, 2016 - 9:39:45 AM - Darren Wheatley Back To Top (40840)

 also, I'm thinking the case statement must be part of the demonstration, since you could use

'[' + cast (age/10*10 as varchar) + '-' + cast (age/10*10+9 as varchar) + ']' as AgeRange

 


Wednesday, March 2, 2016 - 9:36:08 AM - Darren Wheatley Back To Top (40839)

I'm guessing this will have already been pointed out to you, but aren't your bins overlapping?

Also, I don't think you need to define an upper bound in each case statement as the function exits once it's matched a condition. 

 


Friday, February 5, 2016 - 8:33:48 AM - Michael Hansen Back To Top (40606)

I notice that you use [ as opening and ) as closing - is there a particular reason for this or is this just by chance?















get free sql tips
agree to terms