network, internet, technology-4851079.jpg

Data Availability & Data Sources “Data is the heart of Machine Learning and Data Science”

In the name of Allah, most gracious and most merciful,

1. Introduction

If we are talking about machine learning, and data science then data usually is indispensable. Instead of reaching the mathematical model by physical laws theoretically, we let the machine learning model identify the patterns which we hope to get us promising results based on our evaluation metric. We are trying to fit the data using machine learning algorithms. Given some inputs, pass them through the model, and here are your outputs. Ok?

The raw material of machine learning is data. But is data available?

2. Data Availability

2.1 Ready Labeled Data

The best-case scenario, if you are working with a supervised model, is when you find the labeled data as you need. The same inputs, and outputs you wish your machine learning model to learn from. But that rarely happens.

2.2 Weakly Labeled Data

This is easier to find because here the labels are not exactly what you are looking for, but there is a correlation between your target label and the existing data label. Therefore, using a transformation you could convert your labels to how you like them to be.

2.3 Unlabeled Data

The data exists, but they are not labeled. Here you have to label your data whether by hiring people to label it or by using something like Amazon Mechanical Turk (MTurk) to label your data by providing them with clear instructions on how you want your data to be labeled.

2.3.1 Use existing APIs or libraries

We can use a public “Application Programming Interface” API or library and transform its classes to what is more relevant to us. Yes, it can even have more classes than we need, we can then filter to the classes that we need.

2.3.2 Utilize weak supervision

Since it is difficult to find data that is labeled exactly as we need, we can use weak supervision by transforming the data to a form more relevant to our case. By this, we can create a small amount of data which can be a starting point for us.

2.3.3 Active learning

If labeling data is expensive, you could check Active Learning to select the best data points to label to maximize learning while keeping your labeling cost low.

2.3.4 Learning from implicit and explicit feedback

We can get feedback after deploying our solution. Explicit feedback explicitly asks users to identify if for example our classification or our recommendation was helpful or not. Of course, this should be within a limit so that we don’t bother the user by always providing feedback feeling that he is serving us instead of us serving him. Implicit feedback could be extracted implicitly using some features that we detect from users’ interactions for instance.

But from where could you get these data?

3. Data Sources

Here is a brief overview of the data sources:

  1. Use a public dataset
  2. Scrape data from websites
  3. Product Intervention: By developing instrumentation in your already existing product to automatically collect data from users. But this takes time.
  4. Increase your existing data “Data Augmentation”:
    1. If you are dealing with texts you have a bunch of options to choose from like
      1. Synonym Replacement: Replace“k” non-stopwords in a sentence with their synonyms.
      2. Back Translation: Use a machine translation library to translate your text to a different language then back-translate the text to the original language using the same machine translation library. However, this can lose certain important words so you can use TF-IDF-based word replacement.
      3. Bigram Flipping: Divide the sentence into bigrams, then randomly flip one bigram.
      4. Replace Entities: Replace entities like person name, location, organization, etc., with other entities in the same category.
      5. Add Noise: Replace a word in a sentence randomly with another word close to its spelling, or replace a few characters with their neighboring keyboard characters according to QWERTY keyboard for example for simulating the “fat finger” problem.
      6. Build Training Data Automatically without Manual Labeling: Using heuristics you can create synthetic data by transforming existing data. You can use the snorkel tool, Easy Data Augmentation (EDA), and NLPAug.b. If you are dealing with images you can do the following:
    1. Image Rotation
    2. Image Flipping
    3. Image Noising
    4. Image Blurring


Thank you. I hope this post has been beneficial to you. I would appreciate any comments if anyone needed more clarifications or if anyone has seen something wrong in what I have written in order to modify it, and I would also appreciate any possible enhancements or suggestions. We are humans, and errors are expected from us, but we could also minimize those errors by learning from our mistakes and by seeking to improve what we do.

Allah bless our master Muhammad and his family.


Notify of
Inline Feedbacks
View all comments