Learn more about SQL Server tools

mssqltips logo
 

Tutorials          DBA          Dev          BI          Career          Categories          Webcasts          Whitepapers          Today's Tip          Join

Tutorials      DBA      Dev      BI      Categories      Webcasts

DBA    Dev    BI    Categories

 

Introduction to the SQL Server Analysis Services Linear Regression Data Mining Algorithm


By:   |   Read Comments   |   Related Tips: > Analysis Services Tutorials

Quickly Resolve Performance Problems for IIS, .NET and SQL Server       >>>   Get Started


Problem

In this tip we walk through how to setup and view data using SQL Server Analysis Services Linear Regression Data Mining Algorithm.

Solution

In data mining and machine learning circles, the linear regression algorithm is one of the easiest to explain. The linear regression algorithm generates a linear equation that best fits a set of data containing an independent and dependent variable. In this tip, we show how to create a simple data mining model using the Linear Regression algorithm in SQL Server Analysis Services 2014. Before we start, we will execute the following T-SQL code to create our sample dataset with 10,000 rows.

SET NOCOUNT ON
GO

CREATE TABLE dbo.tblLinearRegression
(
   pkLinearRegressionExample int identity (1,1) primary key,
   AttributeX integer NOT NULL,
   AttributeY decimal(18, 2) NOT NULL
) ON PRIMARY

GO

declare @i as integer
set @i=1

while @i<=2500
begin
   insert into dbo.tblLinearRegression values (@i, @i * 0.5 + round((300-100)* rand() +100,2))
   insert into dbo.tblLinearRegression values (@i, @i * 0.5 + round((300-100)* rand() +100,2))
   insert into dbo.tblLinearRegression values (@i, @i * 0.5 + round((300-100)* rand() +100,2))
   insert into dbo.tblLinearRegression values (@i, @i * 0.5 + round((300-100)* rand() +100,2))
   set @i=@i+1
end
GO

The values for the AttributeY column will be a random distribution of data around the line y = 0.5x + 200 as shown in the figure below.

Introduction to the SQL Server Analysis Services Linear Regression Data Mining Algorithm

In Visual Studio, create a new Analysis Services Multidimensional and Data Mining Project.

In Visual Studio, create a new Analysis Services Multidimensional and Data Mining Project.

In this tip, we will name the project LinearRegressionExample. Click on OK when finished with the New Project window.

In this tip, we will name the project LinearRegressionExample.

In the Solution Explorer window, right-click on the Data Sources folder and choose "New Data Source..." to initiate the Data Source Wizard.

In the Solution Explorer window, right-click on the Data Sources folder and choose "New Data Source..."

Click on "Next >".

Initiate the Data Source Wizard.

Choose your data connection, if one exists. If a data connection does not exist, click on "New..." to create a new data connection.

Choose your data connection

In this example, we are using a connection to the Tips database on the localhost.

We are using a connection to the Tips database on the localhost.

Click on "Next >".

Click on "Next >".

On the Impersonation Information screen, click on "Use a specific Windows user name and password." Enter your username and password. Click on "Next >".

On the Impersonation Information screen, click on "Use a specific Windows user name and password.

On the Completing the Wizard screen, the data source name can be changed if desired. Click on "Finish".

On the Completing the Wizard screen, the data source name can be changed if desired. Click on "Finish".

The new data source will appear in the Solution Explorer.

The new data source will appear in the Solution Explorer.

In the Solution Explorer window, right-click on the Data Source Views folder and choose "New Data Source View..." to launch the Data Source View Wizard.

In the Solution Explorer window, right-click on the Data Source Views folder and choose "New Data Source View..."

Click on "Next >".

Launch the Data Source View Wizard.

On the Select a Data Source page in the Relational data sources list box, select the data source we created in the above step. Click on "Next >".

On the Select a Data Source page in the Relational data sources list box

On the Select Tables and Views page, move the table tblLinearRegression from the Available Objects list box to the Included Objects list box by selecting tblLinearRegression in the Available objects list box and then clicking on the ">" box. Click on "Next >".

On the Select Tables and Views page, move the table tblLinearRegression from the Available Objects list box

On the Completing the Wizard page, give the Data Source View a name and click on "Finish".

Give the Data Source View a name and click on "Finish".

The data source view now appears in the Solution Explorer window. In the center of the image below, we see the columns listed in the table tblLinearRegression.

The data source view now appears in the Solution Explorer window.

In the Solution Explorer, right-click on the Mining Structures folder and select "New Mining Structure..." to launch the Data Mining Wizard.

In the Solution Explorer, right-click on the Mining Structures folder

Click on "Next >".

Launch the Data Mining Wizard.

Press the "From existing relational database or data warehouse" radio button and then click "Next >".

Press the "From existing relational database or data warehouse" radio button and then click "Next >".

Select the Microsoft Linear Regression as the data mining technique. Please note in the description that the Linear Regression algorithm is a special application of the Decision Trees algorithm and will not split the data as a decision tree would normally do. There will be an informational message displayed later that will refer to there not being a split in the data.

Select the Microsoft Linear Regression as the data mining technique.

On the Select Data Source View page, we will use our previously defined data source object. Click on "Next >".

On the Select Data Source View page, we will use our previously defined data source object. Click on "Next >".

Next, check the Case box on the ColumnsForDataMining line. Click on "Next >".

Check the Case box on the ColumnsForDataMining line. Click on "Next >".

On the Specify the Training Data page, check the box in the Key column that corresponds with the pkLinearRegressionExample column. The AttributeX column will be used as input because it is the independent variable. The AttributeY column is the dependent variable, so we check the Predictable box for the AttributeY column. Click on "Next >".

On the Specify the Training Data page, check the box in the Key column.

The default values for the Content Type and Data Type are shown below on the Specify Columns' Content and Data Type page. Click on "Next >".

The default values for the Content Type and Data Type are shown below on the Specify Columns' Content and Data Type page

We will use 30% of our data for testing the mining model's accuracy. Click on "Next >".

We will use 30% of our data for testing the mining model's accuracy. Click on "Next >".

On the Completing the Wizard screen, we can rename the mining structure name and the mining model name. Click on "Finish".

Rename the mining structure name and the mining model name

Our mining structure now appears in the Solution Explorer.

Our mining structure now appears in the Solution Explorer.

The Mining Structure tab is selected by default. At this point the Analysis Services objects reside in the Visual Studio project. Click on the Mining Model Viewer tab.

The Mining Structure tab is selected by default.

Visual Studio will attempt to deploy the SSAS objects to the server specified in the project properties. When asked if "Would you like to build and deploy the project first?", choose "Yes".

When asked if "Would you like to build and deploy the project first?", choose "Yes".

When given the warning about the time it could take to process the mining model and asked "Do you wish to continue?", choose "Yes". The number of records in the view is not a large amount, so it should not take more than a minute to process.

The number of records in the view is not a large amount, so it should not take more than a minute to process.

When the Process Mining Model window appears, press the "Run..." button.

When the Process Mining Model window appears, press the "Run..." button.

The Process Progress window will appear. When the process completes successfully, the informational message "Decision Trees found no splits for model" will display which is OK. Select "Close" in the Process Progress window and "Close" again in the Process Mining Model window.

The Process Progress window will appear

Depending on your hardware configuration, the Load Mining Model Content window might appear stating to "Please wait...". The Deployment Progress window will appear also stating that the SSAS objects were successfully deployed to the Analysis Services server.

The Deployment Progress window will appear also stating that the SSAS objects were successfully deployed to the Analysis Services server.

Clicking on the Mining Legend tab (as shown on the bottom right of the following image) will give us the equation for the line that best fits the data.

Clicking on the Mining Legend tab will give us the equation for the line that best fits the data.

The equation derived from the model is y = 0.501x + 197.609, which is approximately the y = 0.5x + 200 equation we set out to attain in this example. The y-intercept comes from the Coefficient column where the Term is blank.

The y-intercept comes from the Coefficient column where the Term is blank.

The slope comes from the Coefficient column where the Term is Attribute X. Notice there are 7,000 Existing Cases. This is because we specified above that 30% of our records (3,000 out of 10,000) were used for testing the model while the remaining 70% were used for training the model.

 
The slope comes from the Coefficient column where the Term is Attribute X

In the figure below, the linear equation derived from the model is plotted on top of our original data, which shows that the line is an excellent best fit for our data.

The linear equation derived from the model is plotted on top of our original data, which shows that the line is an excellent best fit for our data.
Summary

In this tip, we have provided an introduction to the Linear Regression data mining algorithm in SQL Server 2014 Analysis Services.

Next Steps

Explore changes to the results by changing the source data used by the mining model. Also, check out these other tips on data mining in SQL Server Analysis Services.



Last Update:


signup button

next tip button



About the author
MSSQLTips author Dallas Snider Dr. Dallas Snider is an Assistant Professor in the Computer Science Department at the University of West Florida and has 18+ years of SQL experience.

View all my tips





Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name    *Email    Notify for updates 


SQL tips:

*Enter Code refresh code     



Learn more about SQL Server tools