Machine Learning Introduction

In the name of Allah, most gracious and most merciful,

In this post, I will talk about the job role of the machine learning engineer and the machine learning pipeline. If you want to see its big picture, you could refer to this Machine Learning Post.

1. Important Distinctions (What are the major roles of a Machine Learning Engineer)

First of all, I want to give a glimpse over the machine learning role. In the field of machine learning, you will mostly hear about the following job roles:

1.1 Data Engineer

He is responsible for data acquisition by getting data from different sources, and put them in a suitable format for the data scientist, and machine learning engineer to work on. He could be also responsible for data-streaming. Of course, there are much more details in this job but I am just giving a fast simple overview.

You could think that the key sentence here to be “Obtaining Raw Data and Transform it to Suitable format”.

1.2 Data Scientist

Using the provided data from the data engineer in suitable formats, he cleans that data and tries to analyze the data to understand it. He uses descriptive statistics to understand the data, trends in it and extracting business insights from it to finally present these insights to a company’s decision-makers for instance so that they could better develop and improve their products, or services.

He could do some of the machine learning engineer’s jobs but not as much as the machine learning engineer. So, there is a kind of intersection but still, each of them has his core skills. For example, it could be a major problem if the data scientist doesn’t know how to analyze certain data. He could also build interactive dashboards, use storytelling to better present his findings to the decision-makers.

The key sentence here is “Getting Insights to Present to Others”.

1.3 Machine Learning Engineer

He is also responsible for analyzing, and understanding data but with the purpose of extracting features from it. These features are then provided as inputs to the machine learning models so that that model could detect patterns in the data and now you have a model. Give it similar inputs, and it will predict outputs. In other words, your model learned from solved examples to solve similar problems. In addition to this, machine learning models could segment data or divide them into clusters that share similar features or attributes.

This trained model could then be deployed to a desktop application, web application or a mobile application for instance.

The major roles of the machine learning engineer is to know:

  1. Feature Engineering: How to extract suitable features from data. This is step is very crucial, and it what separates great machine learning engineers from others. They could even generate new features by understanding their data, and seeing that a new feature will be better for the machine learning model they will use.
  2. Modeling: Choosing the suitable machine learning model that would better model the data. Of course, this involves understanding when to use which model according to the data, the field or domain, and how to tune the model’s parameters (hyperparameters) so that they would better model the trends in the data or better fit the data. The best model is evaluated by different evaluation metrics like accuracy, precision, recall, F1-score, R2 Score, and so on.

Depending on the company you are working for, these roles could intersect. For example, in a startup company you could do the data engineer work because they don’t have a data engineer, and so on.

2. Machine Learning Pipeline

Usually machine learning involves the following steps

2.1 Understanding the Business Problem

Domain knowledge is very important. You need to understand the problem from a business perspective. What do you want to do? For what? How it can help your users? How much critical is your business? Is it related to humans’ lives like cancer detection, or not so critical like a recommendation system? This will even help you to choose your evaluation metrics downstream.

2.2 Problem Formalization

Define the problem at hand. If you are building a spam detector, for example. Ask what is the definition of spam? What do you want to predict?

2.3 Data Collection

This should be given to you by the data engineer. If not, then look for how would you collect data? In competitions, this is provided for you with the options in some of these competitions to add data with certain restrictions according to the competition’s rules.

2.4 Exploratory Data Analysis (EDA) and Data Preprocessing

This is not an easy step. It could take the majority of your time because here is where you understand your data, extract feature from it, check for missing data, outliers, and how to handle them appropriately. If this is step is done appropriately, you could find that there is no need for a complex model afterwards since simple models could do the job better than the complex models.

What is a feature?

I have mentioned the word feature many times, but it could be vague for people coming from outside the field. In simple words, a feature is what you want your model to see.

Let me give you some intuition, if you look at human’s face, how would you recognize if he is happy, angry, or sad? Well, you detect that from specific parts of his face like his eye brows, his eye look, and his mouth expression, and so on. How would an experienced doctor know what the patient’s disease from just looking at his face, or by asking specific question? Again, he detected some features on the patients body, and face, and the questions he asks reveal some features about his patient so that he could predict what is the patient illness. Another simple example is how do you know the brand of a laptop or a mobile? You mostly look at the logo, and that’s it. This is a very simple feature.

Features are the information that an expert needs to make a prediction in his domain of expertise. Features are the golden clues that reveal the information secrets required to make an accurate prediction. That’s why feature engineering is super important. Remove noise from your data and give your machine learning model features with the highest predictive powers.

Feature Intuition

Simply put, garbage in, garbage out. Therefore, if you didn’t do this step well and just gave your machine learning model irrelevant features or features that don’t have any predictive power, then don’t expect any good predictions whatever greatness and complexity your model has. You are giving the model noise.

Features could also be created from the input data features so that you combine features to form a new feature with a high predictive power.

2.5 Modeling

Now given the features you have, what is the best model that could learn patterns from these features. It depends on the relationship between features.

If you are features are linearly depending, you could go for linear models. If non-linearity exists, you could use non-linear models. What if there is no clear relationship appearing in your data, then you could go for tree-based models or ensemble models.

If you want to predict a continuous number like the temperature, you have the regression models. But if you want to classify categories, then you have the classification models.

Models have parameters, like control buttons. To control these hyperparameters, you need to do what is called hyperparameter tuning. Like if you have a remote control with different buttons.

Therefore, you need a combination of choosing the suitable model, and tuning its hyperparameters to get the best predictive model.

2.6 Model Evaluation

As I mentioned previously, there are different evaluation metrics to know how well your model is performing. Choosing which metric depends on the criticality of your domain, whether it is a classification or regression problem, and other factors.

For example, if you built a spam detection. How many spam emails that are truly spam were detected as spam? And how many of them were misclassified? And how many non-spam emails were wrongly classified as spam?

2.7 Model Deployment

If everything was fine, you could now deploy your model to a desktop application, web application, or mobile application so that real users could use it.

2.8 Non-linear Iteration

What if after deploying your data, the world has changed a little bit, users has new trends, and there are new data. Therefore your model needs to be updated with the new user trends. You could even discover something new about the business problem, get feedback from users which helped you identify more powerful features, and so on.

In addition to this, you could revisit any of the previous steps in a non-linear way which means that you could go to step 3 then to step 4, then discover something you have missed in step 2 so you go back to step 2, and so on. We are humans, and mistakes are expected from us so you could go to any of the previous steps because there you found some mistake for instance, or you have got a better idea in any of the previous steps’ details, and so on.


Thank you. I hope this post has been beneficial to you. I would appreciate any comments if anyone needed more clarifications or if anyone has seen something wrong in what I have written in order to modify it, and I would also appreciate any possible enhancements or suggestions. We are humans, and errors are expected from us, but we could also minimize those errors by learning from our mistakes and by seeking to improve what we do.

Allah bless our master Muhammad and his family.


Notify of
Inline Feedbacks
View all comments