What is Data Science and what is the life cycle of a Data Science project?
Data Science is a scientific method to extract, transform and analyze data using algorithms to understand the insights from data. It can also be defined as a combination of statistics, data analysis and machine learning.
What is a Data Scientist?
Data scientist is a professional in data science. A data scientist is an expert scientist (in the relevant domain) and also an expert analyst. In addition, the data scientist is an expert developer who can develop and maintain algorithms.
Mandatory Skills for a Data Scientist
- Programming knowledge (Python, R and SQL)
- Data transformation (data extraction, cleansing and transformation)
- Statistics modeling
- Data analysis and visualization
- Machine learning
Statistics is a mathematical approach to collect, analyze and present the data in large quantities. You will learn and apply linear regression, Gaussian distribution, correlation, sample size probability, differential equations and stochastic.
Machine learning is a science which helps computers learn by themselves with a large quantity of data, without explicit programming. Once the computer has learned the pattern, it will predict future values or classify future data.
In conventional programming methodology, we will give instructions to computer systems to act upon based on several conditions (business rules). However, in machine learning the computer systems itself learns from the actual data and it doesn’t need explicit programming.
For example, let’s say we a have set of 10,000 diabetes patients details. We can pass this dataset to a computer system to learn and build its own rules to decide whether a given patient is a diabetic or not.
Machine Learning Examples
- Identifying cancer in the preliminary stages based on several patient’s attributes and scanned images
- Analyzing hand-written notes
- Image recognition
- Face and voice recognition for authentication
Machine learning uses these algorithms to build its own rules
- Linear Regression
- Logistic Regression
- Decision Tree
- K-means Clustering
- Anomaly Detection
Machine learning uses statistics to build and calculate the above-mentioned algorithms. It uses the programming language such as Python or R to make the computer systems learn and build from a sample dataset. Once the model has been built, then it can be used for prediction.
Data Science Life Cycle
Every data science project has 6 phases and it is an iterative process.
- Problem Definition
- Data Collection
- Data Preparation
- Model Creation
- Model Evaluation
- Model Deployment
In this phase we are trying to understand the problem / question that the organization wants to solve. This can be something we want to predict or a decision on something we want to make or test a hypothesis. You will ask many questions to understand the problem such as:
- Who are the customers?
- What is the current process and how does it work?
- What are all the information we are collecting now?
In this phase, you will learn about the functional knowledge to a greater depth. After analyzing the problem, you will be able to come up with a definition of the problem or a question you want to solve using data science.
As we understood the problem, now our aim is to collect the data from many sources. If the data is already available, you may source the data. If the data is not available, then you may create a new dataset.
In this phase, you will leverage your technical skills to source the data. The data may be available in the form of files or in relational database or in an unstructured format. You will decide the best available tool to extract data from one or more sources.
As the source data may contain many errors and missing attributes, it is mandatory to enrich or remove bad data. This phase is usually referred to as “Data Wrangling” or “Data Munging “. This step often takes 60-70 % of overall time taken to complete the project.
In this phase, we create a model to predict the outcome or to support the hypothesis or to make a decision. The model can be a numerical model or a statistical model or a machine learning model.
In the next phase, we need to evaluate the model to assess the suitability of model for the defined problem/question. This model can also be validated against the given data and context.
Once the model has been validated, the model can be deployed. It means, the findings can be shared with others and the results can be used to make a decision or actions can be taken based on the results.
The above steps will be repeated for each question we would like to answer. The whole process is iterative and all the steps will often be revisited several times. Sometimes the chosen model may not bring the ideal results and the whole process can be terminated at a given stage.
Data Science Application
The application areas of data science are wide and comprehensive. Some of the examples can be found below.
- Search engines are using data science algorithms to deliver search results in less than a second
- Amazon uses data science to prepare a list of product suggestions based on previous search results and user experience
- Image recognition algorithm has been used by Facebook, WhatsApp and Google to detect human faces and enable search using images.
- Speech recognition algorithm has been widely used in the products like Google Voice, Siri and Cortona.
- Improved fraud and risk detection systems with the help of data science
- Development of self-driving cars
Data science has an amazing scope across many areas such as finance, bioinformatics, supply chain optimization, health and well being. In this tip, we have learned about the basics of Data Science. In future tips, we will learn more practical tips to apply data science algorithms to common problems.
- Stay tuned to read the next tip on Data Science in MSSQLTips.com.
- Read and understand the differences of Data Science and other BI jargon here.
- Learn more about the data science process here.
Last Update: 2018-05-14
About the author
View all my tips