Data Science project life cycle

Any project starts with a well-defined problem statement (like forecast the sales of an X item present in its inventory in the coming month or the cause of customer churn) or a not well-defined problem (like how to increase sales of a product).

Data science enables us to solve this business problem with a series of well-defined steps. Generally, these are the steps we mostly follow to solve a business problem. All the terminologies related to data science falls under different steps which we are going to understand just in a while

Step 1: Business understanding

Step 2: Collecting data

Step 3: Pre-processing data

Step 4: Analysing data

Step 5: Data Modelling

Step 6: Model Evaluation

Step 7: Model Deployment

Step 8: Driving insights and generating BI reports

Step 9: Taking a decision based on insights

Let us discuss these steps in detail:

Step 1: Business understanding

The business need is the starting point in the life cycle. Hence it is important to understand what the problem statement is and ask the right questions to the customer that helps us understand the data well and derive meaningful insights from the data.

We have all the technology to make our lives easy but still with this tremendous change the success of any project depends on the quality of questions asked for the dataset.

Every domain and business works with a set of rules and goals. In order to acquire the correct data, we should be able to understand the business. Asking questions about the dataset will help in narrowing it down to correct data acquisition.

We typically use data science to answer five types of questions:

How much or how many? (regression)
Which category? (classification)
Which group? (clustering)
is this weird? (anomaly detection)
Which option should be taken? (recommendation)

In this stage. you should also be identifying the central objective of your project by identifying the variables that need to be predicted.

A few right questions that other successful businesses have asked in the past of their data science teams

Uber — What percentage of time do drivers actually drive? How steady is their income?
Oyo Hotels — What is the average occupancy of mediocre hotels?
Alibaba — What are the per-square-foot profits of our warehouses?

All these questions are a necessary first step before we can embark on a data science journey. Having asked the correct question we move on to collecting data

Step 2: Collecting data

The primary step in the lifecycle of data science projects is to first identify the person who knows what data to acquire and when to acquire based on the question to be answered. The person need not necessarily be a data scientist but anyone who knows the real difference between the various available data sets and making hard-hitting decisions about the data investment strategy of an organization — will be the right person for the job.

Data might need to be collected from multiple types of data sources.

Few Examples of Data Source.

File format Data(Spreadsheet, CSV, Text files, XML, jSON)
Relational Database
Non-relational Database(NoSQL)
Scraping Website Data using tools

Our first terminology, BIG DATA, fits here. Big data is nothing but any data which is too big/complex to handle. Big data does not necessarily mean data that is large in science. Big data is characterized by 4 different properties and if your data exhibits this property then it is qualified to be called Big data. These properties are defined by 4 V’s.
– Volume: Data in terabytes

– Velocity: Streaming data with high throughput

– Variety: Structured, semi-structured, and unstructured

– Veracity: quality of the data that is being analyzed

In a retail business, a lot of transactions happen every second by many customers, a lot of data is maintained in a structured or unstructured format about customers, employees, stores, sales, etc. All this data put together is overly complex to process or even comprehend. Big data technologies like Hadoop, Spark, Kafka simplifies our work here.

Step 3: Cleaning data

Often referred to as the data wrangling phase as well. Data scientists often complain that this is the most boring and time-consuming task involving the identification of various data quality issues.

In this step, we understand more about the data and prepare it for further analysis. The data understanding section of the data science methodology answers the question: Is the data that you collected representative of the problem to be solved?

This is one task that you will always end up doing. Cleaning data essentially means removing discrepancies from your data such as missing fields, improper values, setting the right format of the data, structuring data from raw files, etc.

Format the data into the desired structure, remove unwanted columns and features. Data preparation is the most time-consuming yet arguably the most important step in the entire life cycle. Your model will be as good as your data. This is similar to washing veggies to remove the surface chemicals. Data collection, data understanding, and data preparation take up to 70% — 90% of the overall project time.

This is also the point where if you feel the data is not proper or enough for you to proceed you go back to the data collection step.

Step 4: Analyzing data

EXPLORE… EXPLORE… EXPLORE

Exploratory analysis is often described as a philosophy, and there are no fixed rules for how you approach it. There are no shortcuts for data exploration.

Remember the quality of your inputs decides the quality of your output. Therefore, once you have got your business hypothesis ready, it makes sense to spend a lot of time and effort here.

To understand the data a lot of people look at the data statistics like mean, median, etc. People also plot the data and look at its distribution through plots like histogram, spectrum analysis, population distribution, etc.

Now we create a plan to do analytics on the data. There can be different types of data analytics that can be performed on the data depending upon the problem at hand. Different types of analytics may include as below:

💡 Descriptive Analytics (what has happened in the past?)

We can use data aggregation methods tools to provide insights into what had happened in the past.

💡 Predictive Analytics (what could happen in the future?)

We can use statistical methods and other forecast techniques including data mining and machine learning to understand and estimate what could happen in the future.

💡 Prescriptive Analytics (what should we do?)

We can use optimization and simulation methods to make the decision and describe possible outcomes for what-if and if-what analysis

This step of the data science project lifecycle does not produce any meaningful insights. However, through regular data cleaning, data scientists can easily identify what foibles exist in the data acquisition process, what assumptions they should make, and what models they can apply to produce analysis results.

So, we first determine which type of analytics we intend to perform. This is part of data analytics. After getting structured data from the cleaning operations (which is generally the case), we perform the data mining operation in order to identify and discover hidden patterns and information in a large dataset. This is known as data mining.

For example, identifying seasonality in sales. Data analysis is the more holistic approach, but data mining tends to find the hidden patterns in data only. These discovered patterns are fed to data analysis approaches taken to generate hypotheses and find insights.

Step 5: Data Modelling/ Machine Learning modeling

This stage seems to be the most interesting one for almost all data scientists. Many people call it “a stage where the magic happens”. But remember magic can happen only if you have the correct props and technique. In terms of data science, “Data” is that prop, and data preparation is that technique. So before jumping to this step make sure to spend a sufficient amount of time in prior steps.

Modeling is used to find patterns or behaviors in data. These patterns either help us in one of two ways —

descriptive modeling (Unsupervised learning) — Recommender systems that are if a person liked the movie Matrix they would also like the movie Inception or
predictive modeling (Supervised Learning) — This involves getting a prediction on future trends e.g. linear regression where we might want to predict stock exchange values

Supervised Learning:

Supervised learning is a technique in which we teach or train the machine using data, which is well labeled.

To understand Supervised Learning let us consider an analogy. As kids we all needed guidance to solve math problems. Our teachers helped us understand what addiction is and how it is done. Similarly, you can think of supervised learning as a type of Machine Learning that involves a guide. The labeled data set is the teacher that will train you to understand patterns in the data. The labeled data set is nothing but the training data set.

The pic below shows Supervised Learning. By doing so, you are training the machine by using labeled data. In Supervised Learning, there is a well-defined training phase done with the help of labeled data.

A few examples of Supervised Algorithms:

Naive Bayes
Random Forest
Neural Network Algorithms
k-Nearest Neighbor (kNN)
Linear Regression
Logistic Regression
Support Vector Machines(SVM)
Decision Trees
Boosting
Bagging

Unsupervised Learning:

Unsupervised learning involves training by using unlabeled data and allowing the model to act on that information without guidance. Think of unsupervised learning as a smart kid that learns without any guidance.

A few examples of Unsupervised Algorithms:

PCA
KMeans/Kmeans++
Hierarchical Clustering
DBSCAN
Market Basket Analysis

Below are some of the standard practices involved to understand, clean, and prepare your data for building your predictive model:

Variable Identification
Univariate Analysis
Bi-variate Analysis
Missing values treatment
Outlier treatment
Variable transformation
Variable creation

Finally, we will need to iterate over steps 4–7 multiple times before we come up with our refined model.

Step 6: Model Evaluation

A common question that professionals often have when evaluating the performance of a machine learning model in which dataset they should use to measure the performance of the machine learning model. Looking at the performance metrics on the trained dataset is helpful but is not always right because the numbers obtained might be overly optimistic as the model is already adapted to the training dataset. Machine learning model performances should be measured and compared using validation and test sets to identify the best model based on model accuracy and over-fitting.

Based on the business problem models could be selected. It is essential to identify what is the task, is it a classification problem, regression or prediction problem, time series forecasting, or a clustering problem. Once the problem type is sorted out, the model could be implemented.

A few examples of Classification metrics:

Classification Accuracy
Confusion matrix
Logarithmic Loss(Log Loss)
Area under curve (AUC)
F-Measure (F1 Score)
Precision
Recall

A few examples of Regression metrics:

Mean Absolute Error (or MAE)
Mean Square Error (MSE)
Root Mean Squared Error (RMSE)
MAPE

The model should be a robust one and not an overfitted model. If it is an overfitted model then predictions for future data will not come out accurately.

Step 7: Driving insights and BI reports

In this process, technical skills only are not sufficient. One essential skill you need is to be able to tell a clear and actionable story. If your presentation does not trigger actions in your audience, it means that your communication was not efficient. It should be in line with business questions. It should be meaningful to the organization and the stakeholders. Presentation through visualization should be such that it should trigger action in the audience. Remember that you will be presenting to an audience with no technical background, so the way you communicate the message is key.

A few tools used for Viz purpose:

Tableau
Power BI
R — ggplot2, lattice
Kibana
Grafana
Spotfire
Python — Matpoltlib, Seaborn, Plotly.

Step 8: Model Deployment

After building models it is first deployed in a pre-production or test environment before actually deploying them into production.

Whatever the shape or form in which your data model is deployed it must be exposed to the real world. Once real humans use it, you are bound to get feedback. Capturing this feedback translates directly to life and death for any project.

A few frameworks used for Model deployment:

Flask
Django
FastAPI

Popular and widely used cloud providers are,

AWS
Azure
Google Cloud

Step 9: Taking actions

Actionable insights from the model show how Data Science has the power of doing predictive analytics and prescriptive analytics. This gives us the power to learn how to repeat positive results, or how to prevent negative results.

Based upon all the insights we have gathered through observation of data or through machine learning model’s result, we get into a state where we can make some decisions regarding any business problem at hand.

Few Examples are:

How much stock of item X do we need to have in inventory? How much discount should be given to item X to boost its sales and maintain the trade-off between discount and profit?
How much attrition is predicted and what can be done to avoid the same?

Each step has its own importance and will be going through multiple iterations back and forth. Multiple people from different technical stacks will be working in coordination for making a successful deliverable.

Hence, last but not least communicating with multiple teams is much needed for a smoother completion of the project.

0 0 votes

Évaluation de l'article

0 Commentaires

Commentaires en ligne

Afficher tous les commentaires