In the name of Allah, most gracious and most merciful,
1. What is an Algorithm?
It is a set of instructions for performing a specific task. A simple addition or multiplication is a simple algorithm while compressing a file, or playing a video is a more complex algorithm. Even Google uses a complex algorithm to retrieve the most relevant search documents or web pages for you based on your search query (what you write in the search bar).
2. Normal Algorithm vs Machine Learning Algorithms
So algorithms already existed for a long time, so what is new about machine learning algorithms. Well, the point is that machine learning algorithms are adaptable. In other words, you could look at them as clay (machine learning algorithms) that could be easily shaped depending on the shape of the surface (data) they are attached to, but this clay isn’t fully adaptable since it has some restrictions depending on the material it is made of for instance. Some clays could really fit any shape while others could only fit stepped shapes since they can’t be fully bent.
Therefore in supervised machine learning you have knobs (parameters) that are learned from your training data. What if your training data has changed? Well, train your machine learning algorithm on your new data to fit that data.
3. Machine Learning Algorithms Categories
Machine Learning algorithms could be divided into three categories according to their use.
- Classification: When we want to classify something from a list of possible categories like predicting if the image is of a cat or a dog.
- Regression: When we want to predict a continuous number like the price of a house given its location, number of rooms, and its size.
- Clustering: Detecting groups that share similar traits of features like grouping customers interested in science together, customers interested in business together, and so on.
Of course, Classification, and Regression are used in supervised learning, while Clustering is used in unsupervised learning. If these terms are new to you, you could refer to this post to understand these terminologies.
4. Parameters vs Hyperparameters
Now given that we have a machine learning model. It is important to distinguish between parameters and hyperparameters. Parameters are like knobs that you use to tune your machine learning model while training on the training set. But during training, we (Machine Learning Engineers) need to see some indications to know if the machine learning model is learning correctly or not. So apart from parameters that are like (internal knobs), there is another type of external knobs (hyperparameters) that the machine learning engineers tune by themselves to yield the best model.
5. Training vs Testing vs Validation Set
In supervised machine learning, we split our data into three sets.
- Training Set: It is the majority of your data. It could range from about 70% to 90%, depending on your data size, and what is enough for your case. Using the training set, we adjust the machine learning model’s parameters.
- Validation Set: The part of the data used for tuning the machine learning model’s hyperparameters.
- Testing Set: The part of the data that is used for the final test of the model as if it is in a real-life scenario. These data should never be used in tuning parameters or hyperparameters. It is data that should be completely new to the model (i.e. it has never seen) so that we could evaluate if the model will do well on something it has never seen before or not.
It is like in exams if the professor gave students similar questions to what they have learned, it is difficult to distinguish students who really understood the material versus students who just memorized the material. The newer the test is, the more indicator the test exam results will be of students’ true understanding of the material.
6. Learning and Optimization
6.1 Loss and Cost Function
But what tells the model’s parameters how to change (increase or decrease), and with how much amount in order to fit the data well? Here comes the loss or cost function. It is a function indicating the error that the model has (i.e. the difference between the true prediction, and the model’s output prediction). This function shouldn’t be a difference, it could be the absolute difference, or squared difference, and there different loss functions that you could check.
To give you an example, we could get the derivative of the cost function with respect to the model’s parameters. The derivative tells us which direction we should move in (increase or decrease) the model’s parameters in order to minimize the error (i.e. the cost function).
Optimization is how we finally update the model’s parameters. There are many optimizers like Gradient Descent and Adam Optimizer. Even in gradient descent, there are three different ways we could use gradient descent.
- Batch Gradient Descent: Update all the model’s parameters after finding the error of all training data. However, this is computationally heavy and takes a very long time since you have to hold all your training data in memory during optimization.
- Stochastic Gradient Descent: Update all the model’s parameters after finding the error of small subsets of your training data. So you update parameters based on part of the data and not all the data. But this leads to noisy updates because of the noise that exists in part of the data. Therefore it could take much time to reach the optimum value (i.e. convergence could be slow), and we have missed some parts of our data that are better to be exploited to yield a better model.
- Mini-Batch Gradient Descent: Divide all your data into batches. Then update the model’s parameters after finding the error of each batch. By doing this we have solved the long time, and heavy computation of the Batch Gradient Descent while taking all the data into consideration so we solved the Stochastic Gradient Descent problem as well. So it is something in the middle between the two that is effective and by using it we mitigate the effect of noise in the data.
6.3 Overfitting vs Underfitting
Given that we have trained our model, there are two important terminologies to be aware of. Overfitting, and underfitting. Simply stated, overfitting is when the model trains very well on the training data that it nearly memorized its patterns without filtering out noise. Without being able to filter true patterns from noisy patterns the model won’t be able to generalize well to unseen data (test data). To solve overfitting we could do what is called regularization.
However, underfitting is when the model trains really badly on the training data that it even doesn’t detect its patterns well. It didn’t fit the data well. Therefore it will have bad performance on the test data as well since it didn’t learn well enough from the data.
The optimum case is in the middle between overfitting, and underfitting when the model learns the patterns in the training data while filtering out the noise. Therefore the model is able to distinguish between true signal and noisy signal in the training data which results in good generalization on unseen data (the testing dataset).
7. When to use which?
There are many supervised machine learning algorithms. Using which one depends on many factors like:
7.1 Data Type
The type of data, and the relation between its different features. If the data is linearly separable, you could use Linear Classification or Regression Algorithms. However, if your data is non-linearly then you could go for the non-linear classification and regression algorithms.
Linearly separable data means that the relationships between the features are linear so I can predict the output by using a linear combination of the input features. You could think of linear as multiplication, division, addition, or subtraction. Non-linear means the relationship could be represented using a non-linear function like the exponent, log, sin, cos, and so on.
7.2 Evaluation Metrics
Knowing the data type will reduce my search for a suitable algorithm, but still, there are many algorithms to choose from for a certain data type. Here you could use evaluation metrics to choose the best model. There are many evaluation metrics. Which one to choose depends on if you are doing a classification or regression problem. Even given that you are in one of them, there are more details. I will talk about evaluation for classification models, and regression models in separate posts in the future.
Evaluation is done after setting all of the model’s parameters and hyperparameters. Using that final model we compare its performance (its answers) against the true answers that we have held out (testing set).
The more you have experience, the more you know when to use which model. Even some experts don’t know the reason behind their choice. Rules ruins experts, they work by intuition. It just makes sense for them without necessarily being able to explain why. It is like they have developed a human-learning model in their mind whose parameters are unknown to them. By experience, and deliberate practice they have this intuition. Therefore practice still makes a lot of difference. Practice makes permanent especially when it is done deliberately. If you don’t know deliberate practice, you could google it.
Experts in any field are the minority. They are rare. To know more about expertise you could check this wonderful book.
Thank you. I hope this post has been beneficial to you. I would appreciate any comments if anyone needed more clarifications or if anyone has seen something wrong in what I have written in order to modify it, and I would also appreciate any possible enhancements or suggestions. We are humans, and errors are expected from us, but we could also minimize those errors by learning from our mistakes and by seeking to improve what we do.
Allah bless our master Muhammad and his family.