Why Machine Learning? and Glimpse on its Big Picture

In the name of Allah, most gracious and most merciful,

Table of Contents

1. Engineering and Mathematics

Machine learning has transformed many industries, and we observe these in our daily lives. That is because of the way machine learning works to solve problems. Before machine learning, we had to figure out the physical laws governing the complex phenomena which are usually difficult to describe and to get into its nitty-gritty details. Therefore, we simplify things by making assumptions so that we can at least come up with something close to reality describing the phenomenon we are seeing. At the end of the day, we want to solve our problems.

Let me elaborate more on this before delving into what I want to say about machine learning because I see that understanding this point is crucial. What is engineering? It is solving problems. But how can I solve problems? Well, by describing it in a precise way so that we understand it on a deeper level then we could use the tools we have to fix that problem. Yes, I understand but what do you exactly mean by describing a problem in a precise way? I mean describing it mathematically so that we could use our pools of mathematical tools to solve that problem. In other words, mathematics is the tool we use to describe our problems, represent them in very great ways that may reveal even more great insights, and then using mathematics again we can find solutions to our problems and also optimize them.

But how on earth could we describe things mathematically? Well, usually a lot of things in life could be looked at as entities, and relationships between entities. For example, a car is an entity that is composed of many components entities like the gear system, the braking system, the lighting system, and so on. These entities are interconnected through input-output relationships so the output(s) of one or more entities could be fed as input(s) to one or more systems. In the world of mathematics, each of these entities could be described as a mathematical model (i.e. mathematical equation). This mathematical model could have more than one input and outputs more than one output. Therefore now we converted a lot of things in life as mathematical models which usually have some assumptions to make the model complexity acceptable. Why assumptions? Again because the reality is too much difficult to describe so we make reasonable assumptions to simplify the mathematical model to be able to use it in our engineering applications.

2. Put on Your Engineering Eyes

Now that I explained these ideas, let’s put on our engineering eyes and see the physical world differently. Now under the hood, the physical world consists of some simplified mathematical models “Entities”, and there are input-output relationships between these mathematical models. To accommodate some mathematical model’s nature we could do some acceptable transformations to the inputs and outputs so the mathematical model gets what it expects as input, and therefore produces reasonable outputs.

3. The Machine Learning Revolution

But what is the usefulness of what you just said? Why mention these too many details and complexities? Because I want you to understand how machine learning works and the beauty of it. Now the machine learning is about using data to generate these models instead of purely understanding the governing physical laws and equations. For example in supervised (one type of machine learning), give me too many inputs and outputs of your model and guess it! I will try to figure out the approximate mathematical model for you! Do you see? it is another transformative great way of solving problems that needed three very important things to have:

Data (Since we are living in the age of big data, so data is everywhere)
High computational power to process these huge amounts of data (Computers are becoming more powerful and cheaper. You could look up at something called Moore’s law.)
Mathematical tools that will help us fit the data on adjustable mathematical algorithms, optimization tools to ensure that the algorithms are learning from data properly, and analytical tools that will tell us how far we are from the correct more accurate mathematical models

In other words, part of the complexity of discovering the entities’ mathematical governing equations is reduced by relying on high-quality data.

But beware that that doesn’t mean that we don’t want to discover the entities’ mathematical governing equations from physical laws. There are often inaccuracies in machine learning and even the deep learning models that could be mitigated by complementing these models with some rules and heuristics that are from the domain knowledge or the mathematical governing equations. Therefore, domain experts still have their place but their jobs are somehow changed from having to do everything alone to feeding their domain knowledge into the machine so that the machine could do great work with the help of the data and the domain experts. In addition to this, machine learning and deep learning have their own difficulties, and they are not necessarily the solution for every problem. There are tradeoffs that one should evaluate to finally have the most effective and efficient model. Sometimes doing 95% accuracy in a short time could be better than doing 99% in a long time.

In summary, machine learning has revolutionized the way we solve problems especially super difficult problems that need many details that are hard to solve using rules or deriving the exact governing equations. For example, we have observed great recent advancements in Natural Language Processing and Computer Vision applications thanks to machine learning and deep learning unique and smart ways of solving problems.

4. What to Learn from Data

Machine learning could be divided into two approaches:

Instance-based learning: It is when the system learns by using a similarity measure to measure how similar is a new data point it didn’t see before to a data point it learned on. So it would for example classify a new data point with the same class as the old data point if the similarity distance is within a certain threshold.
Model-based learning: When the system learns specific parameters using the training data to make prediction on new unseen test data. So it learns parameters and not just sees how much something is similar to another thing as in instance-based learning.

5. How to Learn from Data

5.1 How to Split Data

Split it into training, and testing sets. You learn from the training data, and then evaluate the “generalization error” or the “out-of-sample” error using the test data. But wait, that’s not the whole story. There is another thing called validation set which is part of the training set. Therefore the training set is further reduced to (Reduced Training Set + Validation Set). The validation set is used to tweak the model’s hyperparameters which don’t depend on the data so they are manually set. However, the model’s parameters are the parameters that are learned by fitting the data on the model. Therefore there is a difference between the model’s automatically fitted parameters using the data and the model’s manually set hyperparameters.

It is super important to know that the testing and validation sets must be as representative as possible to the real-life data or the data in production since they are the sets that you evaluate the model’s performance on. So what if your evaluation method is flawed, well you are collapsing the whole model since how would you know if your model is good or bad without proper evaluation.

5.2 Further Split

Again this is not the whole story of splitting. Let me elaborate more. Let us assume that we have too much data that is non-representative of the production data, but we have another small representative data what could we do then? Given that the testing and validation sets must be as close as possible to production data we could divide the small representative data by half 50% – 50% on the testing and validation sets. Then we could train our data on the training set and then evaluate and see.

If the validation set results are disappointing, the problem could be either that you are overfitting the training data or the data mismatch “i.e. non-representativeness” is the problem. How could you know then what is the exact problem?

We want to isolate the mismatch problem so let us look only at the non-representative data we have to know the exact problem. So, as Dr. Andrw Ng. suggests, we could further split the reduced training set into another further reduced training set and a train-dev set. Then train on the further reduced training set, and see the train-dev set performance. If it is poor, then that means that you have overfitted the non-representative training data so you should try to simplify the model or regularize it or get more training data. Otherwise, if the train-dev set performance is good then that means that your model isn’t overfitting but is generalizing well on unseen train-dev set. Therefore the original validation set poor results could be due to data mismatch “i.e. non-representativeness” so you could see how to get more representative data for instance.

5.3 How Much Data to Use

Batch Learning (Offline Learning): The system learns on all the data at once which will generally take much time and computational resources so it is usually done offline. The system is trained, deployed to production, and runs without anymore learning but it just applies what it learned.
Online Learning: You could learn on batches of data incrementally batch by batch. On fitting each batch, you update the model’s parameters for instance. You are taking small steps to fit the model bit by bit. Each learning step is fast and cheap so the system can learn from new data as it arrives on the fly.

You could do something similar to this online learning when you have too much data that couldn’t be fit on one machine’s main memory (hence out-of-core learning), but this is usually done offline so online learning here could be confusing. Therefore, you could think of it as incremental learning.

5.4 How Fast to Learn (Learning Rate) – Adapting to Changing Data

On using a small learning rate, the system learns more slowly, it is stable and less sensitive to noisy data and outliers.
On using a high learning rate, the system will fastly adapt to new data, but it will also tend to forget the old data.

6. Machine Learning Challenges

Since machine learning relies on both data and machine learning algorithms, challenges exist in both of these things:

6.1 Data

Poor Quality Data: If the data you have has irrelevant features, whatever machine learning model you have you won’t do much good. It is like teaching a baby to learn the correct answer from the wrong answer. He will learn the wrong answer and won’t be able to get the correct answer.
Not Enough Data: When data is abundant and is in high quality you could observe that different machine learning models yield very close results. Therefore having enough data could make a big difference. However, it may not necessarily be always the case. There could be exceptions and other details, but I am just giving you the big picture.
Data not representative of the real-world scenario “Data Mismatch”: It is very important that your data specifically “your test and validation set” is representative of the real-world data it will encounter. If the data is too small, you may fall into the “Sampling Noise” problem (i.e. data is nonrepresentative as a result of chance). This representativity problem could also occur because of the way you sampled which could be flawed and biased like asking biased questions or whatever biased data collection method even If you have a large amount of data it is still not representative and this is called the “Sampling Bias”.

6.2 Algorithms

Overfitting: The algorithm is memorizing the patterns in your data very well that it didn’t distinguish between the signal and the noise. Therefore your model has learned the noise as well which made it perform very well on training data but can’t generalize well on unseen new data.
Underfitting: The algorithm is so simple that it couldn’t detect enough patterns in the data. So its performance is bad on both the training data and the testing (unseen new) data. It is like trying to fit non-linearly separable data using a linear model.

Finally

Thank you. I hope this post has been beneficial to you. I would appreciate any comments if anyone needed more clarifications or if anyone has seen something wrong in what I have written in order to modify it, and I would also appreciate any possible enhancements or suggestions. We are humans, and errors are expected from us, but we could also minimize those errors by learning from our mistakes and by seeking to improve what we do.

Allah bless our master Muhammad and his family.

Reference

https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1492032646

Share on Facebook