Any project starts with a well-defined problem statement (like forecast the sales of an X item present in its inventory in the coming month or the cause of customer churn) or a not well-defined problem (like how to increase sales of a product).
Data science enables us to solve this business problem with a series of well-defined steps. Generally, these are the steps we mostly follow to solve a business problem. All the terminologies related to data science falls under different steps which we are going to understand just in a while
Step 1: Business understanding
Step 2: Collecting data
Step 3: Pre-processing data
Step 4: Analysing data
Step 5: Data Modelling
Step 6: Model Evaluation
Step 7: Model Deployment
Step 8: Driving insights and generating BI reports
Step 9: Taking a decision based on insights
Let us discuss these steps in detail:
The business need is the starting point in the life cycle. Hence it is important to understand what the problem statement is and ask the right questions to the customer that helps us understand the data well and derive meaningful insights from the data.
We have all the technology to make our lives easy but still with this tremendous change the success of any project depends on the quality of questions asked for the dataset.
Every domain and business works with a set of rules and goals. In order to acquire the correct data, we should be able to understand the business. Asking questions about the dataset will help in narrowing it down to correct data acquisition.
We typically use data science to answer five types of questions:
In this stage. you should also be identifying the central objective of your project by identifying the variables that need to be predicted.
A few right questions that other successful businesses have asked in the past of their data science teams
All these questions are a necessary first step before we can embark on a data science journey. Having asked the correct question we move on to collecting data
The primary step in the lifecycle of data science projects is to first identify the person who knows what data to acquire and when to acquire based on the question to be answered. The person need not necessarily be a data scientist but anyone who knows the real difference between the various available data sets and making hard-hitting decisions about the data investment strategy of an organization — will be the right person for the job.
Data might need to be collected from multiple types of data sources.
Few Examples of Data Source.
Our first terminology, BIG DATA, fits here. Big data is nothing but any data which is too big/complex to handle. Big data does not necessarily mean data that is large in science. Big data is characterized by 4 different properties and if your data exhibits this property then it is qualified to be called Big data. These properties are defined by 4 V’s.
– Volume: Data in terabytes
– Velocity: Streaming data with high throughput
– Variety: Structured, semi-structured, and unstructured
– Veracity: quality of the data that is being analyzed
In a retail business, a lot of transactions happen every second by many customers, a lot of data is maintained in a structured or unstructured format about customers, employees, stores, sales, etc. All this data put together is overly complex to process or even comprehend. Big data technologies like Hadoop, Spark, Kafka simplifies our work here.
Often referred to as the data wrangling phase as well. Data scientists often complain that this is the most boring and time-consuming task involving the identification of various data quality issues.
In this step, we understand more about the data and prepare it for further analysis. The data understanding section of the data science methodology answers the question: Is the data that you collected representative of the problem to be solved?
This is one task that you will always end up doing. Cleaning data essentially means removing discrepancies from your data such as missing fields, improper values, setting the right format of the data, structuring data from raw files, etc.
Format the data into the desired structure, remove unwanted columns and features. Data preparation is the most time-consuming yet arguably the most important step in the entire life cycle. Your model will be as good as your data. This is similar to washing veggies to remove the surface chemicals. Data collection, data understanding, and data preparation take up to 70% — 90% of the overall project time.
This is also the point where if you feel the data is not proper or enough for you to proceed you go back to the data collection step.
EXPLORE… EXPLORE… EXPLORE
Exploratory analysis is often described as a philosophy, and there are no fixed rules for how you approach it. There are no shortcuts for data exploration.
Remember the quality of your inputs decides the quality of your output. Therefore, once you have got your business hypothesis ready, it makes sense to spend a lot of time and effort here.
To understand the data a lot of people look at the data statistics like mean, median, etc. People also plot the data and look at its distribution through plots like histogram, spectrum analysis, population distribution, etc.
Now we create a plan to do analytics on the data. There can be different types of data analytics that can be performed on the data depending upon the problem at hand. Different types of analytics may include as below:
We can use data aggregation methods tools to provide insights into what had happened in the past.
We can use statistical methods and other forecast techniques including data mining and machine learning to understand and estimate what could happen in the future.
We can use optimization and simulation methods to make the decision and describe possible outcomes for what-if and if-what analysis
This step of the data science project lifecycle does not produce any meaningful insights. However, through regular data cleaning, data scientists can easily identify what foibles exist in the data acquisition process, what assumptions they should make, and what models they can apply to produce analysis results.
So, we first determine which type of analytics we intend to perform. This is part of data analytics. After getting structured data from the cleaning operations (which is generally the case), we perform the data mining operation in order to identify and discover hidden patterns and information in a large dataset. This is known as data mining.
For example, identifying seasonality in sales. Data analysis is the more holistic approach, but data mining tends to find the hidden patterns in data only. These discovered patterns are fed to data analysis approaches taken to generate hypotheses and find insights.
This stage seems to be the most interesting one for almost all data scientists. Many people call it “a stage where the magic happens”. But remember magic can happen only if you have the correct props and technique. In terms of data science, “Data” is that prop, and data preparation is that technique. So before jumping to this step make sure to spend a sufficient amount of time in prior steps.
Modeling is used to find patterns or behaviors in data. These patterns either help us in one of two ways —
Supervised learning is a technique in which we teach or train the machine using data, which is well labeled.
To understand Supervised Learning let us consider an analogy. As kids we all needed guidance to solve math problems. Our teachers helped us understand what addiction is and how it is done. Similarly, you can think of supervised learning as a type of Machine Learning that involves a guide. The labeled data set is the teacher that will train you to understand patterns in the data. The labeled data set is nothing but the training data set.
The pic below shows Supervised Learning. By doing so, you are training the machine by using labeled data. In Supervised Learning, there is a well-defined training phase done with the help of labeled data.
A few examples of Supervised Algorithms:
Unsupervised learning involves training by using unlabeled data and allowing the model to act on that information without guidance. Think of unsupervised learning as a smart kid that learns without any guidance.
A few examples of Unsupervised Algorithms:
Below are some of the standard practices involved to understand, clean, and prepare your data for building your predictive model:
Finally, we will need to iterate over steps 4–7 multiple times before we come up with our refined model.
A common question that professionals often have when evaluating the performance of a machine learning model in which dataset they should use to measure the performance of the machine learning model. Looking at the performance metrics on the trained dataset is helpful but is not always right because the numbers obtained might be overly optimistic as the model is already adapted to the training dataset. Machine learning model performances should be measured and compared using validation and test sets to identify the best model based on model accuracy and over-fitting.
Based on the business problem models could be selected. It is essential to identify what is the task, is it a classification problem, regression or prediction problem, time series forecasting, or a clustering problem. Once the problem type is sorted out, the model could be implemented.
A few examples of Classification metrics:
A few examples of Regression metrics:
The model should be a robust one and not an overfitted model. If it is an overfitted model then predictions for future data will not come out accurately.
In this process, technical skills only are not sufficient. One essential skill you need is to be able to tell a clear and actionable story. If your presentation does not trigger actions in your audience, it means that your communication was not efficient. It should be in line with business questions. It should be meaningful to the organization and the stakeholders. Presentation through visualization should be such that it should trigger action in the audience. Remember that you will be presenting to an audience with no technical background, so the way you communicate the message is key.
A few tools used for Viz purpose:
After building models it is first deployed in a pre-production or test environment before actually deploying them into production.
Whatever the shape or form in which your data model is deployed it must be exposed to the real world. Once real humans use it, you are bound to get feedback. Capturing this feedback translates directly to life and death for any project.
A few frameworks used for Model deployment:
Popular and widely used cloud providers are,
Actionable insights from the model show how Data Science has the power of doing predictive analytics and prescriptive analytics. This gives us the power to learn how to repeat positive results, or how to prevent negative results.
Based upon all the insights we have gathered through observation of data or through machine learning model’s result, we get into a state where we can make some decisions regarding any business problem at hand.
Few Examples are:
Each step has its own importance and will be going through multiple iterations back and forth. Multiple people from different technical stacks will be working in coordination for making a successful deliverable.
Hence, last but not least communicating with multiple teams is much needed for a smoother completion of the project.