Data Science for SQL Server Professionals

Problem

What is Data Science and what is the life cycle of a Data Science project?

Solution

Data Science

Data Science is a scientific method to extract, transform and analyze data using algorithms to understand the insights from data. It can also be defined as a combination of statistics, data analysis and machine learning.

What is a Data Scientist?

Data scientist is a professional in data science. A data scientist is an expert scientist (in the relevant domain) and also an expert analyst. In addition, the data scientist is an expert developer who can develop and maintain algorithms.

Mandatory Skills for a Data Scientist

Programming knowledge (Python, R and SQL)
Data transformation (data extraction, cleansing and transformation)
Statistics modeling
Data analysis and visualization
Machine learning

Jargon Busters

Statistics

Statistics is a mathematical approach to collect, analyze and present the data in large quantities. You will learn and apply linear regression, Gaussian distribution, correlation, sample size probability, differential equations and stochastic.

Machine Learning

Machine learning is a science which helps computers learn by themselves with a large quantity of data, without explicit programming. Once the computer has learned the pattern, it will predict future values or classify future data.

In conventional programming methodology, we will give instructions to computer systems to act upon based on several conditions (business rules). However, in machine learning the computer systems itself learns from the actual data and it doesn’t need explicit programming.

For example, let’s say we a have set of 10,000 diabetes patients details. We can pass this dataset to a computer system to learn and build its own rules to decide whether a given patient is a diabetic or not.

Machine Learning Examples

Identifying cancer in the preliminary stages based on several patient’s attributes and scanned images
Analyzing hand-written notes
Image recognition
Face and voice recognition for authentication

Machine learning uses these algorithms to build its own rules

Linear Regression
Logistic Regression
Decision Tree
K-means Clustering
Anomaly Detection

Machine learning uses statistics to build and calculate the above-mentioned algorithms. It uses the programming language such as Python or R to make the computer systems learn and build from a sample dataset. Once the model has been built, then it can be used for prediction.

Data Science Life Cycle

Every data science project has 6 phases and it is an iterative process.

Problem Definition
Data Collection
Data Preparation
Model Creation
Model Evaluation
Model Deployment

Data Science Process - Description: Data Science Process

Problem Definition

In this phase we are trying to understand the problem / question that the organization wants to solve. This can be something we want to predict or a decision on something we want to make or test a hypothesis. You will ask many questions to understand the problem such as:

Who are the customers?
What is the current process and how does it work?
What are all the information we are collecting now?

In this phase, you will learn about the functional knowledge to a greater depth. After analyzing the problem, you will be able to come up with a definition of the problem or a question you want to solve using data science.

Data Collection

As we understood the problem, now our aim is to collect the data from many sources. If the data is already available, you may source the data. If the data is not available, then you may create a new dataset.

In this phase, you will leverage your technical skills to source the data. The data may be available in the form of files or in relational database or in an unstructured format. You will decide the best available tool to extract data from one or more sources.

Data Preparation

As the source data may contain many errors and missing attributes, it is mandatory to enrich or remove bad data. This phase is usually referred to as “Data Wrangling” or “Data Munging “. This step often takes 60-70 % of overall time taken to complete the project.

Model Creation

In this phase, we create a model to predict the outcome or to support the hypothesis or to make a decision. The model can be a numerical model or a statistical model or a machine learning model.

Model Evaluation

In the next phase, we need to evaluate the model to assess the suitability of model for the defined problem/question. This model can also be validated against the given data and context.

Model Deployment

Once the model has been validated, the model can be deployed. It means, the findings can be shared with others and the results can be used to make a decision or actions can be taken based on the results.

The above steps will be repeated for each question we would like to answer. The whole process is iterative and all the steps will often be revisited several times. Sometimes the chosen model may not bring the ideal results and the whole process can be terminated at a given stage.

Data Science Application

The application areas of data science are wide and comprehensive. Some of the examples can be found below.

Search engines are using data science algorithms to deliver search results in less than a second
Amazon uses data science to prepare a list of product suggestions based on previous search results and user experience
Image recognition algorithm has been used by Facebook, WhatsApp and Google to detect human faces and enable search using images.
Speech recognition algorithm has been widely used in the products like Google Voice, Siri and Cortona.
Improved fraud and risk detection systems with the help of data science
Development of self-driving cars

Summary

Data science has an amazing scope across many areas such as finance, bioinformatics, supply chain optimization, health and well being. In this tip, we have learned about the basics of Data Science. In future tips, we will learn more practical tips to apply data science algorithms to common problems.

Next Steps

Stay tuned to read the next tip on Data Science in MSSQLTips.com.
Read and understand the differences of Data Science and other BI jargon here.
Learn more about the data science process here.

Nat Sundar

Nat Sundar is working as an independent SQL BI consultant in the UK with a Bachelor’s Degree in Engineering. He has considerable years of experience with the UK Financial Services Industry (mainly in Insurance and Investment Banking).

He is passionate about SQL Server, SSIS, SSAS, SSRS and MDX. He has presented at SQLBits, SQLPass and the SQL London User Group. He has special interest towards Continuous Integration, Continuous Delivery and Deployment automation for the SQL BI (SSIS, SSAS and SSRS) stack.

He can be contacted at sqlnat@gmail.com, via LinkedIN and on Twitter at @SQLNat.