Data Mining Process Overview
By: Siddharth Mehta
Data Mining can be applied for a variety of purposes. Before one starts considering data mining as a probable solution, one should clearly understand the typical applications of data mining as well as the approach to develop data mining models in an enterprise. Having understood the fundamental considerations behind data mining, one can validate whether data mining would be the right solution for the problem. Also, one can assess the time, effort, infrastructure, and other resources that would be required to develop data mining models. Below are some of the basic concepts related to data mining.
Below are the typical applications of data mining.
Applications of Data Mining
- Forecasting: Estimating sales, predicting server loads or server downtime
- Risk and probability: Choosing the best customers for targeted mailings, determining the probable break-even point for risk scenarios, assigning probabilities to diagnoses or other outcomes
- Recommendations: Determining which products are likely to be sold together, generating recommendations
- Finding sequences: Analyzing customer selections in a shopping cart, predicting next likely events
- Grouping: Separating customers or events into cluster of related items, analyzing and predicting affinities
The Data Mining development methodology is composed of different phases and each phase has its own considerations and purpose as stated below.
- What are you looking for? What types of relationships are you trying to find?
- Does the problem you are trying to solve reflect the policies or processes of the business?
- Do you want to make predictions from the data mining model, or are you just looking for interesting patterns and associations?
- Which outcome or attribute do you want to try to predict?
- What kind of data do you have and what kind of information is in each column?
- If there are multiple tables, how are the tables related?
- Do you need to perform any cleansing, aggregation, or processing to make the data usable?
- How is the data distributed? Is the data seasonal? Does the data accurately represent the processes of the business?
- Data can be scattered across a company and stored in different formats, or may contain inconsistencies such as incorrect or missing entries.
- For example: The data might show that a customer bought a product before the product was offered on the market, or that the customer shops regularly at a store located 2,000 miles from her home.
- Data cleaning is not just about removing bad data or interpolating missing values, but about finding hidden correlations in the data, identifying sources of data that are the most accurate, and determining which columns are the most appropriate for use in analysis.
- Tools for data preparation: SQL Server Integration Services, Master Data Services, Data Quality Services.
- Itís necessary to understand the data in order to make appropriate decisions when you create the mining models. Exploration techniques include calculating the minimum and maximum values, calculating mean and standard deviations, and looking at the distribution of the data.
- For example: You might determine by reviewing the maximum, minimum, and mean values that the data is not representative of your customers or business processes, and that you therefore must obtain more balanced data or review the assumptions that are the basis for your expectations.
- Standard deviations and other distribution values can provide useful information about the stability and accuracy of the results. A large standard deviation can indicate that adding more data might help you improve the model. Data that strongly deviates from a standard distribution might be skewed, or might represent an accurate picture of a real-life problem, but make it difficult to fit a model to the data.
Data Mining Model Development
- You define the columns of data that you want to use by creating a mining structure. The mining structure is linked to the source of data, but does not actually contain any data until you process it.
- When you process the mining structure, Analysis Services generates aggregates and other statistical information that can be used for analysis. This information can be used by any mining model that is based on the structure.
- Before the structure and model is processed, a data mining model too is just a container that specifies the columns used for input, the attribute that you are predicting, and parameters that tell the algorithm how to process the data. Processing a model is often called training.
- Training refers to the process of applying a specific mathematical algorithm to the data in the structure in order to extract patterns. The patterns that you find in the training process depend on the selection of training data, the algorithm you chose, and how you have configured the algorithm.
- SQL Server 2016 contains many different algorithms, each suited to a different type of task, and each creating a different type of model. You can define a new model by using the Data Mining Wizard in SQL Server Data Tools, or by using the Data Mining Extensions (DMX) language.
Data Mining Model Validation
- Analysis Services provides tools that help you separate your data into training and testing datasets so that you can accurately assess the performance of all models on the same data. You use the training dataset to build the model, and the testing dataset to test the accuracy of the model by creating prediction queries.
- You can explore the trends and patterns that the algorithms discover by using the viewers in Data Mining Designer in SQL Server Data Tools.
- You can also test how well the models create predictions by using tools in the designer such as the lift chart and classification matrix.
- To verify whether the model is specific to your data, or may be used to make inferences on the general population, you can use the statistical technique called cross-validation to automatically create subsets of the data and test the model against each subset.
Deploying and updating models
- Use the models to create predictions, which you can then use to make business decisions. SQL Server provides the DMX language that you can use to create prediction queries, and Prediction Query Builder to help you build the queries.
- Create content queries to retrieve statistics, rules, or formulas from the model.
- Use Integration Services to create a package in which a mining model is used to intelligently separate incoming data into multiple tables.
- Create a report that lets users directly query against an existing mining model.
We will be focusing on the data mining model development, validation and deployment phase in the upcoming chapters assuming that we have the required data in place.
- Consider reading this article to install the Adventure Works databases. It is recommended to install the 2014 version of the Adventure Works Data warehouse, though any version will work.