Natural Language Processing Brief Summary

In the name of Allah, most gracious and most merciful,

This is a brief summary of my understanding of Natural Language Processing. I will not enter into much detail in each of the topics I will cover, I will just touch over them so that the big picture and some key points are clear. I hope this post will be a good guide to NLP big picture. One of the missing parts in this big picture is Speech Recognition. Insha’Allah I intend to write another blog post for it because it involves sound, signal, and frequency analysis which is somehow different from the nature of things I am talking about here.

1. Introduction

Natural language is complex and has many ambiguities. It is unstructured which is unlike programming languages for instance. However, we understand each other well because we are somehow designed to understand and speak the language, and we have the human necessary knowledge that helps us to solve natural language complexities through understanding each other in context. The question is, how could we make computers understand the language?

They can process words and phrases, and then they can try to identify Keywords, Parts of Speech, Named Entities, Dates, Quantities, etc. Using this information, they can parse sentences and extract relevant parts of statements, questions, or instructions. They can also analyze whole documents by finding frequent and rare words, assess their sentiment, and they can even group similar documents together. By building up on all these, computers can do many things with unstructured text although they cannot understand it like humans.

NLP pipeline generally has three stages: Text Preprocessing, Feature Extraction, and Modeling. For more details on the Machine Learning pipeline in general and for understanding what is the meaning of features, you can read this Machine Learning Introduction post. By the way, NLP is a subfield of Machine Learning. For more information on what is Machine Learning, you can read this Machine Learning “ML” post.

Therefore for better understanding what I will explain next, you would better read these posts:

  1. Machine Learning “ML”
  2. Machine Learning Introduction

2. Text Preprocessing

2.1 Convert Raw Data to Clean Text

Textual information comes from multiple sources like websites, files (XML, Word, PDF, Excel, etc.), Optical Character Recognition (OCR), or Speech Recognition System (Speech to Text). Depending on the source, the preprocessing of text might be different to remove unnecessary information thus converting raw text into cleaned text that is more useful and relevant to our task for instance without URLs, HTML tags, and other unnecessary characters.

2.2 Further Preprocessing (Preparing Text for Feature Extraction)

Depending on your task, that raw text can further be preprocessed to be more useful. Here are some of the common preprocessing steps. Note that you do not have to necessarily use all of them since it depends on your task. The more you practice, the more you will understand when to use which.

  1. Lowercasing: For named entities, you would better not lowercase your words. Named Entities are noun phrases that refer to specific object, person, or place.
  2. Punctuation Removal
  3. Removing Extra Spaces
  4. Tokenizing
    • Word Tokenization by splitting text into words or tokens
    • Sentence Tokenization by splitting text into sentences
  5. Remove Stopwords which are too common words like (is, are, the, etc.). They do not add much information to the text as other words. They could be removed in sentiment analysis so that we reduce our vocabulary and the complexity of later procedures, but they are important in Part of Speech tagging.
  6. Convert Words into Canonical Form for reducing complexity while preserving the essence of word meaning
    • Stemming: Using search and replace rules for instance, you can stem words to their root form so that prefixes, and suffixes will be removed for instance. For instance: caching, cached, caches —> cach. This is fast, but can produce words that are incomplete like (cach) that do not exist in the English language.
    • Lemmatization: Similar to stemming but using dictionary instead of rules to convert different word variants to their common root. It can detect non-trivial word forms like reducing: is, was, were —> to the root (be) which is difficult to do using stemmer rules. Therefore lemmatization needs dictionary and thus requres more memory but is more accurate since it produces words that exist in the English language.

For Arabic Text Preprocessing you can read this blog post.

3. Feature Extraction

Having clean normalized text, how can we convert this to a suitable representation that can be used as features for the models that we will use? It depends on the model you are using, and the task you want to perform. Therefore, there are features more suitable for document-level tasks like (Spam Detection, and Sentiment Analysis), and features that are more useful for word-level tasks like (Text Generation, and Machine Translation). There are many ways of representing textual information, and through practice, you can learn what you need for each problem.

Generally, we will convert documents, words, and characters to vectors in an n-dimensional space. The vector representation is very useful since we can exploit Linear Algebra by computing the dot product between vectors to capture the similarity between documents, words, or characters. The higher the dot product, the higher similarity between the vectors (documents, words, or characters). We can also divide the dot products by their magnitude product (Euclidean Norms). The dot product can be extended to what is called the TF-IDF which is very common and powerful as I will soon explain.

Here we see the power of Mathematics starts to appear on converting text to numbers.

3.1 Document-level Features

Looking at an entire document or collection of words as one unit. Therefore, inferences are expected to also be on a document-level.

3.1.1 Bag of Words (BoW)

It treats each document as an un-ordered collection (bag of words). Here are the steps required to form a Bag of Words:

  1. The tokens you have after text-preprocessing are now the un-ordered collection or set for each document.
  2. Form your vocabulary by collecting all unique words present in your corpus (all of your documents).
  3. Make your vocabulary tokens the columns of a table. In this table each document is a row.
  4. Convert each document into a vector of numbers representing how many times each word occurs in the document by counting the number of occurrences of each word in each document and enter the value in the corresponding column.
    • Now you have what is called a Document-Term Matrix which contains the relationship between documents in rows, and words or terms in columns.
    • Each element can be considered a Term Frequency. (i.e. The number of times that term (column) occurs in that document (row).
3.1.2 Term Frequency-Inverse Document Frequency (TF-IDF)

The bag of words treats every word as being equally important, but we know that it is not the case in reality and it depends on the document’s topic. Instead, TF-IDF assigns weights to words that signify their relevance in documents.

Here are the steps required for extending the (BoW) to (TF-IDF):

  1. Count the number of documents where each word occurs and insert a new row containing this count for each word in the column. This row is called the Document Frequency.
  2. Divide the Term Frequency in each cell in the table by the Document Frequency of that term. Now we have a number that is proportional to the Term Frequency, and inversely proportional to the Document Frequency thus highlighting words that are more unique to a document, and thus we have a better reprsentation for the document. And that is the core idea behind TF-IDF.
tfidf(t, d, D) = Term_{Frequencey}(t, d) * Inverse_{DocumentFrequency}(t, D)
= \frac {count(t, d)}{\left |d  \right |} * log(\frac {\left |D  \right |} {\left |d \epsilon  D: t\epsilon  d  \right |})

Where: D: The total number of documents in a collection, d: Document, t: Term. There are several variations to TF-IDF equation I showed here that try to smooth the resulting values or prevent edge cases like dividing by zero errors.

3.2 Word-level Features

For deeper text analysis, we want to represent each word by a vector.

3.2.1 One-Hot Encoding

It is the same as Bag of Words except for:

  1. The row represents a word and not a document while the column is as it is representing a word as well.
  2. Replace the Term Frequency with 1 in the intersection of the same words in row and column, and 0 everywhere else.

One-Hot Encoding is also used in data analysis multi-class classification.

3.2.2 Word Embedding

One-hot encoding breaks down when we have a large vocabulary because the size of the word representation grows the number of vocabulary we have. Here where Word Embedding comes into play to control the size of word representation by limiting it to a fixed-size vector. It is a representation for each word in some vector space that has great properties like words with similar meaning are closer in that vector space so the meaning of each word is distributed throughout the vector. We can even do addition and subtraction that makes sense in the embedding space. Similar words are clustered together.

Moreover, words can be close to each other in one dimension like (Tea, and Coffee) are both beverages while they are far from each other in another dimension (i.e. another projection in the n-dimensional space) for example (Tea, and Coffee) are different beverages (a dimension that captures beverages variability). By that, we have the choice to increase the vector size to increase the dimensions, and contexts of meaning it can capture, or we can choose a smaller vector size for decreasing the complexity of the word embedding. So there is a tradeoff between increasing performance by increasing the Word Embedding size and decreasing complexity by decreasing its size. Since in natural language, there are many dimension along which word meanings can change, the more dimensions you have in you word vector, the more expressive that representation will be.

Word Embedding let us enter into the world of NLP Transfer Learning because now we can have a pretrained Word Embedding that we can use efficiently by storing them in a lookup table. So now for each word, we can retrieve its corresponding vector representation from that lookup table. Furthermore, we can change the word embedding slightly during our new task so that the Word Embedding is more specific to our task. It is very similar to the idea of Transfer Learning in Computer Vision.

3.2.2.1 Word2Vec

The core idea behind Word2Vec is the Distributional Hypotheses which states that words that occur in the same contexts tend to have similar meanings. Therefore, a model that is able to predict a given word, given neighboring words (Continuous Bag of Words CBoW), or vice versa, predict neighboring words for a given word (Continuous Skip-gram) is likely to capture the contextual meanings of words.

How it is formed? For example in the Skip-gram model you pick any word, one-hot encode it, then feed it into a neural network or some other probabilistic model that could predict a few surrounding words (the input word’s context) by design. Train the neural network by using the suitable loss function to optimize the model’s weights and other parameters. Now your trained model should be able to predict the context words well. Therefore, the model has somehow understood the language and is able to predict words in context. The intermediate representation like a hidden layer in the neural network is the Word2Vec Word Embedding.

3.2.2.1 Global Vectors for Word Representation (GloVe)

Using co-occurrence statistics, it is trying to come up with a vector representation for each word. Here is how it is formed:

  1. Compute the probability of word j appears in the context of word i (i.e. conditional probability) for all word pairs ij in the given corpus. Word j appears in the context of word i means that word j is in the vicinity of word i by a certain context window (i.e. context of words).
    1. Count all occurrences of i and j in the given corpus
    2. Normalize the count to get a probability
  2. Initialize two random vectors with a fixed-size for each word. Two vectors; one for the word when it is a context word, and one when it is the target word.
  3. For any pair of words (i, j) we want their word-vectors dot product to be equal to their co-occurrence probability that is computed before that. By having this goal, and by choosing an appropriate loss function, we can iteratively optimize these word vectors until we have vectors that capture the similarities and differences between words.

Co-occurence probability values are very low, so it is better to work with the log of these values.

3.3 Character-level Features and Beyond

There are other possible features used in NLP. Let me give you a brief overview.

  • In character-level features your model works on the character-level which has its pros and cons. From its pros is the small number of vocabulary you have since the number of English characters are much less than the number of words, and there is almost no Out of Vocabulray Characters (OOV) which is unlike word-level representations. However, characters in themselves do not carry as much meaning as words.
  • WordPiece (a type of subword tokenization) is something between word-level, and character-level. For example, words with the same stem or root are divided into two parts (one part containing the stem, or root, and the other one containing the suffixes for instance). Therefore, you reduced the number of vocabulary while benfeting from the word-level representation. However, Out of Vocabulary Words (OOV) still exist.
  • There is also something called SentencePiece, and other tokenizers exist. You can find more about them here.
  • Moreover, there are Contextual Word Embedding that captures the word meaning when it appears in different contexts. Some of the ways for producing these Contextual Word Embeddings are using ELMo, and BERT (Bidirectional Encoder Representations from Transformers). More information are found here.

3. Modeling

This section needs many details for someone to understand it well. However, I will go briefly over some concepts to give some sense of the big picture without entering into the level of details as in the previous sections. I will also put some links for very good blog posts that illustrate some concepts very well including good visualizations.

3.1 Machine Learning Models

Machine Learning Models like Logistic Regression, Hidden Markov Model, Naive Bayes, and other Classification, and Clustering Algorithms. However, generally as the data increase, we tend to go more to deep-learning models while other Machine Learning Models’ performance flattens. I am talking generally, but there are exceptions and other considerations that are beyond the scope of this introductory blog post.

3.2 Deep Learning Models

3.2.1 RNN

The sequence of words is important in many NLP applications like Machine Translation, Question Answering, Text Summarization, and Chatbots. Therefore there are special types of Neural Networks specifically dedicated to capturing long-term dependency across sequences like the Vanilla RNN (Recurrent Neural Networks). Other variations of RNN are (LSTM (Long Short-Term Memory), and GRU (Gated recurrent units)) that overcome the RNN’s vanishing gradient problem.

3.2.2 Sequence to Sequence Models (Seq2Seq)

Sequence to sequence is important where the number of inputs and outputs are not necessarily the same like Machine Translation, Question Answering, Text Summarization, and Chatbots. It basically consists of:

  1. Encoder: It captures the meaning out of the input sequence, and produce a fixed-size Context Vector that captures the meaning of input from the RNNs hidden states if we used RNNs in teh Encoder.
  2. Decoder: It converts the context vector into the sequence of outputs.

The fixed-size Context Vector in seq2seq without attention has the major problem of reaching the information bottleneck with long sequences. However, if we increased the fixed size to capture long sequences, the model can overfit on short sequences. Here were attention comes to solve that problem.

3.2.3 Sequence to Sequence with Attention Models (Seq2Seq with Attention)

The major difference is that now the Context Vector is not fixed in size. Its size changes with the input sequence length so that it can capture all the information in the input sequence, and the Decoder can then focus on the relevant parts of the Context Vector during decoding by using a scoring method. Therefore, no information is dismissed.

For more information, you can read this blog post.

3.2.4 Transformers (Going beyond LSTMs)

Why not use attention everywhere without using any RNN or any of its variations (LSTMs or GRUs)? That is what has been proposed in the great paper Attention Is All You Need. This idea made a huge impact in the world of NLP.

For more information, you can read this blog post.

3.2.4 BERT (Bidirectional Encoder Representations from Transformers) – (The Transformer Encoder)

I have taken the following text as it is from this blog post because it is really well-written and summarizes what I want to say exactly:

“ELMo’s language model was bi-directional, but the openAI transformer only trains a forward language model. Could we build a transformer-based model whose language model looks both forward and backwards (in the technical jargon – “is conditioned on both left and right context”)?”

BERT has also made very great advancements not only in NLP but also in Information Retrieval (Yes it is used in Google Search). Again, I have taken the following text as it is from this blog post.

In October 2019Google announced its biggest update in recent times: BERT’s adoption in the search algorithm. Google had already adopted models to understand human language, but this update was announced as one of the most significant leaps in search engine history.

BERT became very popular in NLP that several flavors of it have been trained and released for public use. For instance
Multilingual BERT (mBERT), and BERT for specific languages like AraBERT, and CAMeLBERT for the Arabic language, and so on.

For more information, you can read this blog post.

Finally

Thank you. I hope this post has been beneficial to you. I would appreciate any comments if anyone needed more clarifications or if anyone has seen something wrong in what I have written in order to modify it, and I would also appreciate any possible enhancements or suggestions. We are humans, and errors are expected from us, but we could also minimize those errors by learning from our mistakes and by seeking to improve what we do.

Allah bless our master Muhammad and his family.

References

https://www.udacity.com/course/natural-language-processing-nanodegree–nd892

https://huggingface.co/transformers/tokenizer_summary.html

https://jalammar.github.io/illustrated-bert/

https://rockcontent.com/blog/google-bert/

Subscribe
Notify of
guest
134 Comments
Most Voted
Newest Oldest
Inline Feedbacks
View all comments
slot terpercaya
slot terpercaya
2 years ago

Hi, just wanted to mention, I enjoyed this blog post.
It was inspiring. Keep on posting!

granite
granite
2 years ago

For latest news you have to pay a quick visit web and on the web I
found this web page as a finest website for most up-to-date
updates.

doyle
doyle
2 years ago

Good day very nice site!! Man .. Excellent ..
Amazing .. I will bookmark your website and take the feeds also?
I’m satisfied to find a lot of helpful information right here
within the post, we’d like work out more techniques on this regard, thank
you for sharing. . . . . .

mike
mike
2 years ago

Everything is very open with a precise explanation of the issues. It was really informative. Your website is very useful. Many thanks for sharing!

Elissa
Elissa
2 years ago

Woah! I’m really loving the template/theme of this website.
It’s simple, yet effective. A lot of times it’s challenging to get that “perfect balance” between usability and visual appeal.

I must say you have done a fantastic job with this. In addition, the
blog loads super fast for me on Chrome. Exceptional Blog!

jorjataverner
jorjataverner
2 years ago

Excellent blog you have here. It’s hard to find quality writing like yours nowadays.
I truly appreciate people like you! Take care!!

bola88
bola88
2 years ago

Hi there I am so delighted I found your site, I really found you by accident, while I was browsing on Yahoo for something else, Anyways I am here now and would just like to say kudos for a tremendous post and a all round enjoyable blog (I also love the theme/design), I don’t have time to look over it all at the minute but I have book-marked it and also added in your RSS feeds, so when I have time I will be back to read a lot more, Please do keep up the great work.

glindamaddox
glindamaddox
2 years ago

I savor, cause I discovered just what I was taking a look for.
You have ended my four day long hunt! God Bless you man. Have a great day. Bye

devonstanfield
devonstanfield
2 years ago

Wow, this post is pleasant, my younger sister is analyzing these kinds of things, thus I am going to tell her.

Brooke
Brooke
2 years ago

Thanks for sharing your thoughts on website. Regards

kabartekno
kabartekno
2 years ago

I have been browsing online more than 3 hours today, yet I never found any interesting article like yours. It’s pretty worth enough for me. In my view, if all webmasters and bloggers made good content as you did, the web will be a lot more useful than ever before.

Kiara
Kiara
2 years ago

Spot on with this write-up, I honestly believe this website needs a lot more attention.
I’ll probably be back again to read more, thanks for the information!

idnpoker
idnpoker
2 years ago

Heya i’m for the first time here. I came across this board and I find It truly useful. It helped me out a lot. I hope to offer something back and aid others like you aided me.

Kristal
Kristal
2 years ago

If ѕome one wishes expert νiew on the topic ⲟf blogging afterward i ѕuggest him/her to pay a visit this blog, Keeр up the nice job.

ngan
ngan
2 years ago

Hurrah! In the end I got a weblog from where I can truly obtain useful data regarding my study and knowledge.

berniecescruggs
berniecescruggs
2 years ago

It’s going to be ending of mine day, except before ending I am reading this enormous paragraph to increase my experience.

irishhux
irishhux
2 years ago

It’s really very complicated in this full of activity life to listen news on TV, therefore I only use web for that reason, and take the latest information.

saundramarden
saundramarden
2 years ago

Having read this I thought it was really enlightening. I appreciate you finding the time and effort to put this informative article together. I once again find myself personally spending a lot of time both reading and commenting. But so what, it was still worth it!

lizettelangford
lizettelangford
2 years ago

Hey There. I found your weblog the usage of msn. This is a very smartly written article.

I’ll make sure to bookmark it and return to learn more of your useful info.
Thank you for the post. I’ll definitely comeback.

Denese Guidi
Denese Guidi
2 years ago

Deference to article author, some excellent selective information.

Nealosburne
Nealosburne
2 years ago

It’s remarkable in favor of me to have a website, which is good designed for my knowledge.

Ricolindstrom
Ricolindstrom
2 years ago

Amazing! Its genuinely amazing article, I have got much clear idea about from this article.

Fidelia
Fidelia
2 years ago

Really when someone doesn’t understand after that its up to other users that they will assist, so here it happens.

Shirleywilfred
Shirleywilfred
2 years ago

Very great post. I simply stumbled upon your weblog and wanted to say that I’ve really loved surfing around your weblog posts.

In any case I will be subscribing for your rss feed and I’m hoping you write again very soon!

Margartfry
Margartfry
2 years ago

Hi, always i used to check weblog posts here early in the daylight, because i love to learn more and more.

Brandisteere
Brandisteere
2 years ago

I quite like reading through an article that will make men and women think. Also, thanks for permitting me to comment!

Joseph
Joseph
2 years ago

Appreciating the persistence you put into your blog and detailed information you offer.
It’s good to come across a blog every once in a while that isn’t the same unwanted rehashed information. Fantastic read!
I’ve saved your site and I’m adding your RSS feeds to my Google account.

Estella Mattingly
Estella Mattingly
2 years ago

Thanks for the auspicious writeup. It if truth be told used to be a amusement account it. Look complex to far delivered agreeable from you!

Amy Hutchinson
Amy Hutchinson
2 years ago

Have you ever thought about writing an ebook or guest authoring on other sites?
I have a blog centered on the same ideas you discuss and would really like to have you share some stories/information. I know my audience would value your work.

Jeannette Fitch
Jeannette Fitch
2 years ago

Way cool! Some very valid points! I appreciate you penning this article and also the rest of the site is also very good.

Lamar
Lamar
2 years ago

Exceptional post however, I was wanting to know if you could write a little more on this topic?
I’d be very thankful if you could elaborate a little bit further.
Appreciate it!

Samara Tickell
Samara Tickell
2 years ago

This is a topic that’s close to my heart… Many thanks! Where are your contact details though?

Annette Bracy
Annette Bracy
2 years ago

Hi my friend! I want to say that this post is amazing,
nice written and come with approximately all significant infos.
I’d like to look more posts like this.

Rolando Satterwhite
Rolando Satterwhite
2 years ago

I blog quite often and I genuinely thank you for your information. This great article has truly peaked my interest. I’m going to bookmark your website and keep checking for new details about once a week.

I subscribed to your RSS feed too.

Tammy
Tammy
2 years ago

I like reading through an article that can make men and women think.
Also, thank you for permitting me to comment!

Shelaburgess
Shelaburgess
2 years ago

Hi, all is going nicely here and of course everyone is sharing information, that’s truly good, keep up writing.

Kelley Segura
Kelley Segura
2 years ago

Hi my friend! I wish to say that this article is amazing, great written and include almost all significant infos. I would like to look extra posts like this.

Marita Leal
Marita Leal
2 years ago

This is my first time visit at here and I am in fact impressed to read all at one place.

Cathryn Saucier
Cathryn Saucier
2 years ago

Having read this I believed it was very informative.
I appreciate you finding the time and energy to put this content together.
I once again find myself personally spending a significant amount of time both reading and commenting.

But so what, it was still worthwhile!

Cary Barker
Cary Barker
2 years ago

Exceptional post however I was wanting to know if you could write a litte more on this topic? I’d be very grateful if you could elaborate a little bit more. Cheers!

Mohamed Darley
Mohamed Darley
2 years ago

Howdy very cool site!! Man .. Excellent .. Amazing ..
I will bookmark your blog and take the feeds also? I’m happy to search out so many useful information here within the post, we need work out extra strategies on this regard, thank you for sharing. . . . .

Terrell
Terrell
2 years ago

I think the admin of this site is truly working hard for his site, since here every information is quality-based material.

Trudy Fielder
Trudy Fielder
2 years ago

What’s up, yes this post is in fact nice and I have learned lot of things from it about blogging. thanks.

Brooke Kelsall
Brooke Kelsall
2 years ago

Great post.

Aprillar Nach
Aprillar Nach
2 years ago

Thanks for the auspicious writeup. It in fact used to be a leisure account it. Glance complex to more brought agreeable from you! By the way, how can we keep in touch?

George
George
2 years ago

Pretty nice post. I just stumbled upon your weblog and wished to say that I’ve truly enjoyed surfing around your blog posts. After all I will be subscribing to your rss feed and I hope you write again very soon!

Jody Hawkins
Jody Hawkins
2 years ago

Thanks for sharing your thoughts. I really appreciate your efforts and I am waiting for your next write ups thank you once again.

Lorenzo Shook
Lorenzo Shook
2 years ago

I am really inspired with your writing talents and also with the layout on your weblog. Is that a paid subject matter or did you customize it yourself? Either way keep up the nice quality writing, it’s uncommon to peer a great weblog like this one nowadays.

Anibal Dovey
Anibal Dovey
2 years ago

Wow, that’s what I was searching for, what a information! present here at this web site, thanks admin of this web site.

Senaida
Senaida
2 years ago

This is a good tip particularly to those new to the blogosphere.

Short but very accurate information… Thank you for sharing this one. A must read article!

134
0
Would love your thoughts, please comment.x
()
x