Learn more about SQL Server tools

mssqltips logo
 

Tutorials          DBA          Dev          BI          Career          Categories          Webcasts          Scripts          Today's Tip          Join

Tutorials      DBA      Dev      BI      Categories      Webcasts

DBA    Dev    BI    Categories

 

Exploratory Data Analysis with Python in SQL Server 2017


By:   |   Last Updated: 2018-07-23   |   Comments   |   Related Tips: More > Python

Problem

Python is one of the most prominent languages used for data science apart from R. Python is a very versatile language and can be used for a variety of purposes ranging from web programming to data science applications. Diverse types of charts and graphs make the analysis faster and easier compared to analyzing data by statistics alone. A large number of reporting tools offer a wide variety of visualizations, but when the analysis is in the preliminary phase, these reporting tools may not be of much use as the data analysis would be just in exploratory phase.

For exploratory data analysis, some of the fundamental visualizations are used like Histogram, Scatterplot, Box and Whisker plot, Density plot, pie charts, bar charts, column charts, etc. For data science applications, the most commonly used visualizations for graphical exploratory analysis are – Histogram, Density Plot, Box Plot and Scatterplot.

In this tip, we will learn how to generate diverse types of basic charts used in data science applications by using a sample dataset and Python in SQL Server 2017.

Solution

Python ships with a built-in library named matplotlib for rending graphs and charts. In this tip we will use a sample dataset and will explore the data using the above-mentioned chart types and we will generate these charts using Python and SQL Server 2017.

It is assumed that SQL Server 2017 is installed along with Python on the development machine. If you are new to using Python with T-SQL in SQL Server, it’s highly recommended that you go through the SQL Server 2017 and Python Basics tutorial before proceeding with this tip.  Follow the steps mentioned below to use Python for generating charts and graphs discussed in this tip.

Step 1 - We need a sample dataset before we can start charts and graphs as these visualizations would need some source data to generate the visual. You can use any data that you may already have. In this tip we are using the sample data from NFL playoff records, and below is how the data would look. Create a table with this sample data and name it something relevant. In this tip, we will be using the table name NFLTest.

Dataset - Description: Dataset

Creating a Histogram Chart

A histogram chart is like a bar chart and is used to understand the spread of data in a range of values which is known as a “bin” in data science terminology. It shows the record counts in any given range, and each range is represented by a bar in the histogram. Histograms generally are the very first visualization used to study the spread of data. The limitation is that histograms can represent only one field in the chart. The below example demonstrates the method of creating a histogram chart on the Won field from NFLTest table.

execute sp_execute_external_script 
@language = N'Python',
@script = N'
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Histogram
fig1_handle = plt.figure(figsize=(10,10))
plt.hist(InputDataSet.Won)
plt.title("Gaussian Histogram")
plt.xlabel("Value")
plt.ylabel("Frequency") 
plt.savefig("c:\\Temp\\histogram.png")


',
@input_data_1 = N'SELECT Won, Team, Division, TotalGames FROM NFLTest'  
			

Here we are using the pyplot package in matplotlib library which has distinct functions to create charts along with details of the resolution. We are importing other packages like pandas and numpy which are used for data processing and numerical calculations. First, we create the handle of the figure where we specify the resolution or size of the plot. Then we are using hist plotting function to create the histogram chart. The dataset being used is passed by the @sqlquery parameter where we are just reading the fields from a table. To use this data inside the Python script, we are using the default dataset name – InputDataSet and referring to the fields inside this data frame. Inside the charting functions, we are providing charting related details like title and label. When you execute this code, the savefig function saves the graphic device to which the plotting functions renders the output and a png file with the histogram chart is saved in the file path specified in the code. The same logic is used in all the charts below, only the functions and parameters change respectively.

Histogram - Description: Histogram

Creating a Kernel Density Plot

To visualize a more clear and precise shape of the spread of data, the Kernel Density plot is used. The Kernel Density Plot can be perceived as a smooth histogram. Many complex formulas are involved to derive the curve and shape of kernel density plot. Gaussian_KDE function in the scipy.stats package encapsulates all the logic required to calculate kernel density. You can read more about kernel density estimation from here.

execute sp_execute_external_script @language = N'Python',
@script = N'
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy.stats import gaussian_kde

# Kernel Density Plot
fig2_handle = plt.figure(figsize=(8,8))
density = gaussian_kde(InputDataSet.Won)
xs = np.linspace(0,1000,200)
plt.plot(xs,density(xs))
plt.title("Kernel Density Plot")
plt.xlabel("Wins")
plt.ylabel("Density") 
plt.savefig("c:\\Temp\\KernelDensityPlot.png")

',
@input_data_1 = N'SELECT Won, Team, Division, TotalGames FROM NFLTest'  
			
Kernel Density Plot - Description: Kernel Density Plot

Creating a Box-plot Chart

The box-plot chart is used for outlier analysis, so that data points that are far off from the inter-quartile range are identified. This chart is not ideal if the focus in on individual data points. But this is the first chart that would generally get considered for outlier analysis on multi-variate data. In the below example, we are creating four collections using the random function from the numpy library. In real applications, each collection would be a field from a database. Using the four collections, we are creating four box-and-whiskers, one for each collection. Since we have a collection of box-plots, first we create a subplot and then add it as a parameter to the boxplot function to create one consolidated boxplot chart.

execute sp_execute_external_script language = N'Python',
@script = N'
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Box and Whisker Plot
fig3_handle = plt.figure(figsize=(10,10))
np.random.seed(10)
collection_1 = np.random.normal(100, 10, 200)
collection_2 = np.random.normal(80, 30, 200)
collection_3 = np.random.normal(90, 20, 200)
collection_4 = np.random.normal(70, 25, 200)
data_to_plot = [collection_1, collection_2, collection_3, collection_4]
ax = fig3_handle.add_subplot(111)
bp = ax.boxplot(data_to_plot)
fig3_handle.savefig("c:\\Temp\\Boxplot.png")

' 
			
Box plot - Description: Box plot

Creating a Scatterplot Chart

Any exploratory data analysis is incomplete without a scatterplot chart. This is a visualization that explains the measure of association to analyze the correlation between two numerical variables. Below is an example of a scatterplot with Total Games on the Y-axis and Total Wins on X-axis. This plot shows a near linear relationship between the two variables. The scatter function is used to create a scatterplot.

execute sp_execute_external_script language = N'Python',
@script = N'
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Scatterplot
colors = (0,0,0)
area = np.pi*3
fig_handle = plt.figure(figsize=(10,10))
plt.scatter(InputDataSet.Won, InputDataSet.TotalGames, s=area, c=colors, alpha=0.5)
plt.title("Scatter plot example")
plt.xlabel("Won")
plt.ylabel("Total Games")
plt.savefig("c:\\Temp\\scatterplot.png")

',
@input_data_1 = N'SELECT Won, Team, Division, TotalGames FROM NFLTest'  
			
Scatterplot - Description: Scatterplot

In this way, we can use distinct functions from the matplotlib library in Python and use the same with T-SQL in SQL Server to generate and save charts without using any external reporting tools or technologies.

Next Steps
  • Consider reading more about the matplotlib library and try out different formation options and chart related functions in this library for exploratory data analysis.


Last Updated: 2018-07-23


get scripts

next tip button



About the author
MSSQLTips author Siddharth Mehta Siddharth Mehta is an Associate Manager with Accenture in the Avanade Division focusing on Business Intelligence.

View all my tips




Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name    *Email    Email me updates 


Signup for our newsletter
 I agree by submitting my data to receive communications, account updates and/or special offers about SQL Server from MSSQLTips and/or its Sponsors. I have read the privacy statement and understand I may unsubscribe at any time.



    



Learn more about SQL Server tools