Learn more about SQL Server tools

mssqltips logo
 

Tutorials          DBA          Dev          BI          Career          Categories          Webcasts          Whitepapers          Today's Tip          Join

Tutorials      DBA      Dev      BI      Categories      Webcasts

DBA    Dev    BI    Categories

Free SQL Server Webcast > Building Really Fast SQL Server VMs
 

Data Science for SQL Server Professionals


By:   |   Read Comments   |   Related Tips: More > Professional Development Career

Problem

What is Data Science and what is the life cycle of a Data Science project?

Solution

Data Science

Data Science is a scientific method to extract, transform and analyze data using algorithms to understand the insights from data. It can also be defined as a combination of statistics, data analysis and machine learning.

What is a Data Scientist?

Data scientist is a professional in data science. A data scientist is an expert scientist (in the relevant domain) and also an expert analyst. In addition, the data scientist is an expert developer who can develop and maintain algorithms.

Mandatory Skills for a Data Scientist

  • Programming knowledge (Python, R and SQL)
  • Data transformation (data extraction, cleansing and transformation)
  • Statistics modeling
  • Data analysis and visualization
  • Machine learning

Jargon Busters

Statistics

Statistics is a mathematical approach to collect, analyze and present the data in large quantities. You will learn and apply linear regression, Gaussian distribution, correlation, sample size probability, differential equations and stochastic.

Machine Learning

Machine learning is a science which helps computers learn by themselves with a large quantity of data, without explicit programming. Once the computer has learned the pattern, it will predict future values or classify future data.

In conventional programming methodology, we will give instructions to computer systems to act upon based on several conditions (business rules). However, in machine learning the computer systems itself learns from the actual data and it doesn’t need explicit programming.

For example, let’s say we a have set of 10,000 diabetes patients details. We can pass this dataset to a computer system to learn and build its own rules to decide whether a given patient is a diabetic or not.

Machine Learning Examples

  • Identifying cancer in the preliminary stages based on several patient’s attributes and scanned images
  • Analyzing hand-written notes
  • Image recognition
  • Face and voice recognition for authentication

Machine learning uses these algorithms to build its own rules

  • Linear Regression
  • Logistic Regression
  • Decision Tree
  • K-means Clustering
  • Anomaly Detection

Machine learning uses statistics to build and calculate the above-mentioned algorithms. It uses the programming language such as Python or R to make the computer systems learn and build from a sample dataset. Once the model has been built, then it can be used for prediction.

Data Science Life Cycle

Every data science project has 6 phases and it is an iterative process.

  • Problem Definition
  • Data Collection
  • Data Preparation
  • Model Creation
  • Model Evaluation
  • Model Deployment
Data Science Process - Description: Data Science Process

Problem Definition

In this phase we are trying to understand the problem / question that the organization wants to solve. This can be something we want to predict or a decision on something we want to make or test a hypothesis. You will ask many questions to understand the problem such as:

  • Who are the customers?
  • What is the current process and how does it work?
  • What are all the information we are collecting now?

In this phase, you will learn about the functional knowledge to a greater depth. After analyzing the problem, you will be able to come up with a definition of the problem or a question you want to solve using data science.

Data Collection

As we understood the problem, now our aim is to collect the data from many sources. If the data is already available, you may source the data. If the data is not available, then you may create a new dataset.

In this phase, you will leverage your technical skills to source the data. The data may be available in the form of files or in relational database or in an unstructured format. You will decide the best available tool to extract data from one or more sources.

Data Preparation

As the source data may contain many errors and missing attributes, it is mandatory to enrich or remove bad data. This phase is usually referred to as “Data Wrangling” or “Data Munging “. This step often takes 60-70 % of overall time taken to complete the project.

Model Creation

In this phase, we create a model to predict the outcome or to support the hypothesis or to make a decision. The model can be a numerical model or a statistical model or a machine learning model.

Model Evaluation

In the next phase, we need to evaluate the model to assess the suitability of model for the defined problem/question. This model can also be validated against the given data and context.

Model Deployment

Once the model has been validated, the model can be deployed. It means, the findings can be shared with others and the results can be used to make a decision or actions can be taken based on the results.

The above steps will be repeated for each question we would like to answer. The whole process is iterative and all the steps will often be revisited several times. Sometimes the chosen model may not bring the ideal results and the whole process can be terminated at a given stage.

Data Science Application

The application areas of data science are wide and comprehensive. Some of the examples can be found below.

  • Search engines are using data science algorithms to deliver search results in less than a second
  • Amazon uses data science to prepare a list of product suggestions based on previous search results and user experience
  • Image recognition algorithm has been used by Facebook, WhatsApp and Google to detect human faces and enable search using images.
  • Speech recognition algorithm has been widely used in the products like Google Voice, Siri and Cortona.
  • Improved fraud and risk detection systems with the help of data science
  • Development of self-driving cars

Summary

Data science has an amazing scope across many areas such as finance, bioinformatics, supply chain optimization, health and well being. In this tip, we have learned about the basics of Data Science. In future tips, we will learn more practical tips to apply data science algorithms to common problems.

Next Steps
  • Stay tuned to read the next tip on Data Science in MSSQLTips.com.
  • Read and understand the differences of Data Science and other BI jargon here.
  • Learn more about the data science process here.


Last Update:


next webcast button


next tip button



About the author
MSSQLTips author Nat Sundar Nat Sundar is working as an independent SQL BI consultant in the UK with a Bachelors Degree in Engineering.

View all my tips
Related Resources





Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name    *Email    Email me updates 


Signup for our newsletter
 I agree by submitting my data to receive communications, account updates and/or special offers about SQL Server from MSSQLTips and/or its Sponsors. I have read the privacy statement and understand I may unsubscribe at any time.



    



Learn more about SQL Server tools