Learn more about SQL Server tools

 
 

Tutorials          DBA          Dev          BI          Career          Categories          Events          Whitepapers          Today's Tip          Join

Tutorials      DBA      Dev      BI      Categories      Events

DBA    Dev    BI    Categories

 

Binning SQL Server Data to Reduce Column Cardinality


By:   |   Read Comments (3)   |   Related Tips: > Analysis Services Development

Problem

I have a SQL Server data set based on age with values from 0 to 100. From this dataset I need to build a histogram showing the age distribution in 10 year increments. Ultimately, I need to use this data as input for a SQL Server Analysis Services neural network algorithm? How can I prepare the data to perform the data mining?

Solution

Reducing the number of distinct values in a column can assist in reducing the processing time of data mining models, make for a more generalized model, and help in the interpretation of the results. As shown in a previous tip, How to handle the MAXIMUM_STATES warning message in the SSAS Neural Network Data Mining Model, SQL Server Analysis Services has parameters for the cardinality of the columns used as input to the neural network.

Binning, also known as discretization, is a way of putting similar data values into bins. We will accomplish this binning in T-SQL by using a CASE expression. In this tip, I have a table that contains 101,706 records. This table has an age column with age values from 0 to 100. We will assign the age values to an age range value in increments of 10.

Age column statistics

There are 100 distinct values for the age column.

100 distinct values for the age column

We want to preserve the original age value and add a varchar column to store the new value.

ALTER TABLE ADD COLUMN

Next, we run an UPDATE statement with a CASE expression to assign the age values an age range bin value.

UPDATE statement with a CASE expression to assign the age values an age range bin value

Finally, we select the count of each age range value using a GROUP BY clause.

Counts per age range bin value
Next Steps


Last Update:





About the author
MSSQLTips author Dallas Snider Dr. Dallas Snider is an Assistant Professor in the Computer Science Department at the University of West Florida and has 18+ years of SQL experience.

View all my tips





More SQL Server Solutions




Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name    *Email    Notify for updates 


Get free SQL tips:

*Enter Code refresh code     



Wednesday, March 02, 2016 - 9:39:45 AM - Darren Wheatley Back To Top

 also, I'm thinking the case statement must be part of the demonstration, since you could use

'[' + cast (age/10*10 as varchar) + '-' + cast (age/10*10+9 as varchar) + ']' as AgeRange

 


Wednesday, March 02, 2016 - 9:36:08 AM - Darren Wheatley Back To Top

I'm guessing this will have already been pointed out to you, but aren't your bins overlapping?

Also, I don't think you need to define an upper bound in each case statement as the function exits once it's matched a condition. 

 


Friday, February 05, 2016 - 8:33:48 AM - Michael Hansen Back To Top

I notice that you use [ as opening and ) as closing - is there a particular reason for this or is this just by chance?


Learn more about SQL Server tools