Binning SQL Server Data to Reduce Column Cardinality
I have a SQL Server data set based on age with values from 0 to 100. From this dataset I need to build a histogram showing the age distribution in 10 year increments. Ultimately, I need to use this data as input for a SQL Server Analysis Services neural network algorithm? How can I prepare the data to perform the data mining?
Reducing the number of distinct values in a column can assist in reducing the processing time of data mining models, make for a more generalized model, and help in the interpretation of the results. As shown in a previous tip, How to handle the MAXIMUM_STATES warning message in the SSAS Neural Network Data Mining Model, SQL Server Analysis Services has parameters for the cardinality of the columns used as input to the neural network.
Binning, also known as discretization, is a way of putting similar data values into bins. We will accomplish this binning in T-SQL by using a CASE expression. In this tip, I have a table that contains 101,706 records. This table has an age column with age values from 0 to 100. We will assign the age values to an age range value in increments of 10.
There are 100 distinct values for the age column.
We want to preserve the original age value and add a varchar column to store the new value.
Next, we run an UPDATE statement with a CASE expression to assign the age values an age range bin value.
Finally, we select the count of each age range value using a GROUP BY clause.
- We can use the output of the T-SQL above to create a histogram showing the age distribution.
- Check out these other tips on data mining in SQL Server Analysis Services.
- How to handle the MAXIMUM_STATES warning message in the SSAS Neural Network Data Mining Model
- SQL Server 2012 Analysis Services Association Rules Data Mining Example
- Explaining the Calculations of Probability and Importance for Complex Association Rules in SQL Server 2012 Analysis Services
- Classic Machine Learning Example In SQL Server Analysis Services
- Microsoft Naïve Bayes Data Mining Model in SQL Server Analysis Services
- Data Mining Clustering Example in SQL Server Analysis Services SSAS
- SQL Server Analysis Services Glossary
About the author
View all my tips