Natural Language Processing Brief Summary

134 Comments / Natural Language Processing (NLP) / By Bahgat Ahmed

In the name of Allah, most gracious and most merciful,

This is a brief summary of my understanding of Natural Language Processing. I will not enter into much detail in each of the topics I will cover, I will just touch over them so that the big picture and some key points are clear. I hope this post will be a good guide to NLP big picture. One of the missing parts in this big picture is Speech Recognition. Insha’Allah I intend to write another blog post for it because it involves sound, signal, and frequency analysis which is somehow different from the nature of things I am talking about here.

1. Introduction

Natural language is complex and has many ambiguities. It is unstructured which is unlike programming languages for instance. However, we understand each other well because we are somehow designed to understand and speak the language, and we have the human necessary knowledge that helps us to solve natural language complexities through understanding each other in context. The question is, how could we make computers understand the language?

They can process words and phrases, and then they can try to identify Keywords, Parts of Speech, Named Entities, Dates, Quantities, etc. Using this information, they can parse sentences and extract relevant parts of statements, questions, or instructions. They can also analyze whole documents by finding frequent and rare words, assess their sentiment, and they can even group similar documents together. By building up on all these, computers can do many things with unstructured text although they cannot understand it like humans.

NLP pipeline generally has three stages: Text Preprocessing, Feature Extraction, and Modeling. For more details on the Machine Learning pipeline in general and for understanding what is the meaning of features, you can read this Machine Learning Introduction post. By the way, NLP is a subfield of Machine Learning. For more information on what is Machine Learning, you can read this Machine Learning “ML” post.

Therefore for better understanding what I will explain next, you would better read these posts:

2. Text Preprocessing

2.1 Convert Raw Data to Clean Text

Textual information comes from multiple sources like websites, files (XML, Word, PDF, Excel, etc.), Optical Character Recognition (OCR), or Speech Recognition System (Speech to Text). Depending on the source, the preprocessing of text might be different to remove unnecessary information thus converting raw text into cleaned text that is more useful and relevant to our task for instance without URLs, HTML tags, and other unnecessary characters.

2.2 Further Preprocessing (Preparing Text for Feature Extraction)

Depending on your task, that raw text can further be preprocessed to be more useful. Here are some of the common preprocessing steps. Note that you do not have to necessarily use all of them since it depends on your task. The more you practice, the more you will understand when to use which.

Lowercasing: For named entities, you would better not lowercase your words. Named Entities are noun phrases that refer to specific object, person, or place.
Punctuation Removal
Removing Extra Spaces
Tokenizing
- Word Tokenization by splitting text into words or tokens
- Sentence Tokenization by splitting text into sentences
Remove Stopwords which are too common words like (is, are, the, etc.). They do not add much information to the text as other words. They could be removed in sentiment analysis so that we reduce our vocabulary and the complexity of later procedures, but they are important in Part of Speech tagging.
Convert Words into Canonical Form for reducing complexity while preserving the essence of word meaning
- Stemming: Using search and replace rules for instance, you can stem words to their root form so that prefixes, and suffixes will be removed for instance. For instance: caching, cached, caches —> cach. This is fast, but can produce words that are incomplete like (cach) that do not exist in the English language.
- Lemmatization: Similar to stemming but using dictionary instead of rules to convert different word variants to their common root. It can detect non-trivial word forms like reducing: is, was, were —> to the root (be) which is difficult to do using stemmer rules. Therefore lemmatization needs dictionary and thus requres more memory but is more accurate since it produces words that exist in the English language.

For Arabic Text Preprocessing you can read this blog post.

3. Feature Extraction

Having clean normalized text, how can we convert this to a suitable representation that can be used as features for the models that we will use? It depends on the model you are using, and the task you want to perform. Therefore, there are features more suitable for document-level tasks like (Spam Detection, and Sentiment Analysis), and features that are more useful for word-level tasks like (Text Generation, and Machine Translation). There are many ways of representing textual information, and through practice, you can learn what you need for each problem.

Generally, we will convert documents, words, and characters to vectors in an n-dimensional space. The vector representation is very useful since we can exploit Linear Algebra by computing the dot product between vectors to capture the similarity between documents, words, or characters. The higher the dot product, the higher similarity between the vectors (documents, words, or characters). We can also divide the dot products by their magnitude product (Euclidean Norms). The dot product can be extended to what is called the TF-IDF which is very common and powerful as I will soon explain.

Here we see the power of Mathematics starts to appear on converting text to numbers.

3.1 Document-level Features

Looking at an entire document or collection of words as one unit. Therefore, inferences are expected to also be on a document-level.

3.1.1 Bag of Words (BoW)

It treats each document as an un-ordered collection (bag of words). Here are the steps required to form a Bag of Words:

The tokens you have after text-preprocessing are now the un-ordered collection or set for each document.
Form your vocabulary by collecting all unique words present in your corpus (all of your documents).
Make your vocabulary tokens the columns of a table. In this table each document is a row.
Convert each document into a vector of numbers representing how many times each word occurs in the document by counting the number of occurrences of each word in each document and enter the value in the corresponding column.
- Now you have what is called a Document-Term Matrix which contains the relationship between documents in rows, and words or terms in columns.
- Each element can be considered a Term Frequency. (i.e. The number of times that term (column) occurs in that document (row).

3.1.2 Term Frequency-Inverse Document Frequency (TF-IDF)

The bag of words treats every word as being equally important, but we know that it is not the case in reality and it depends on the document’s topic. Instead, TF-IDF assigns weights to words that signify their relevance in documents.

Here are the steps required for extending the (BoW) to (TF-IDF):

Count the number of documents where each word occurs and insert a new row containing this count for each word in the column. This row is called the Document Frequency.
Divide the Term Frequency in each cell in the table by the Document Frequency of that term. Now we have a number that is proportional to the Term Frequency, and inversely proportional to the Document Frequency thus highlighting words that are more unique to a document, and thus we have a better reprsentation for the document. And that is the core idea behind TF-IDF.

tfidf(t, d, D) = Term_{Frequencey}(t, d) * Inverse_{DocumentFrequency}(t, D)

= \frac {count(t, d)}{\left |d \right |} * log(\frac {\left |D \right |} {\left |d \epsilon D: t\epsilon d \right |})

Where: D: The total number of documents in a collection, d: Document, t: Term. There are several variations to TF-IDF equation I showed here that try to smooth the resulting values or prevent edge cases like dividing by zero errors.

3.2 Word-level Features

For deeper text analysis, we want to represent each word by a vector.

3.2.1 One-Hot Encoding

It is the same as Bag of Words except for:

The row represents a word and not a document while the column is as it is representing a word as well.
Replace the Term Frequency with 1 in the intersection of the same words in row and column, and 0 everywhere else.

One-Hot Encoding is also used in data analysis multi-class classification.

3.2.2 Word Embedding

One-hot encoding breaks down when we have a large vocabulary because the size of the word representation grows the number of vocabulary we have. Here where Word Embedding comes into play to control the size of word representation by limiting it to a fixed-size vector. It is a representation for each word in some vector space that has great properties like words with similar meaning are closer in that vector space so the meaning of each word is distributed throughout the vector. We can even do addition and subtraction that makes sense in the embedding space. Similar words are clustered together.

Moreover, words can be close to each other in one dimension like (Tea, and Coffee) are both beverages while they are far from each other in another dimension (i.e. another projection in the n-dimensional space) for example (Tea, and Coffee) are different beverages (a dimension that captures beverages variability). By that, we have the choice to increase the vector size to increase the dimensions, and contexts of meaning it can capture, or we can choose a smaller vector size for decreasing the complexity of the word embedding. So there is a tradeoff between increasing performance by increasing the Word Embedding size and decreasing complexity by decreasing its size. Since in natural language, there are many dimension along which word meanings can change, the more dimensions you have in you word vector, the more expressive that representation will be.

Word Embedding let us enter into the world of NLP Transfer Learning because now we can have a pretrained Word Embedding that we can use efficiently by storing them in a lookup table. So now for each word, we can retrieve its corresponding vector representation from that lookup table. Furthermore, we can change the word embedding slightly during our new task so that the Word Embedding is more specific to our task. It is very similar to the idea of Transfer Learning in Computer Vision.

3.2.2.1 Word2Vec

The core idea behind Word2Vec is the Distributional Hypotheses which states that words that occur in the same contexts tend to have similar meanings. Therefore, a model that is able to predict a given word, given neighboring words (Continuous Bag of Words CBoW), or vice versa, predict neighboring words for a given word (Continuous Skip-gram) is likely to capture the contextual meanings of words.

How it is formed? For example in the Skip-gram model you pick any word, one-hot encode it, then feed it into a neural network or some other probabilistic model that could predict a few surrounding words (the input word’s context) by design. Train the neural network by using the suitable loss function to optimize the model’s weights and other parameters. Now your trained model should be able to predict the context words well. Therefore, the model has somehow understood the language and is able to predict words in context. The intermediate representation like a hidden layer in the neural network is the Word2Vec Word Embedding.

3.2.2.1 Global Vectors for Word Representation (GloVe)

Using co-occurrence statistics, it is trying to come up with a vector representation for each word. Here is how it is formed:

Compute the probability of word j appears in the context of word i (i.e. conditional probability) for all word pairs ij in the given corpus. Word j appears in the context of word i means that word j is in the vicinity of word i by a certain context window (i.e. context of words).
1. Count all occurrences of i and j in the given corpus
2. Normalize the count to get a probability
Initialize two random vectors with a fixed-size for each word. Two vectors; one for the word when it is a context word, and one when it is the target word.
For any pair of words (i, j) we want their word-vectors dot product to be equal to their co-occurrence probability that is computed before that. By having this goal, and by choosing an appropriate loss function, we can iteratively optimize these word vectors until we have vectors that capture the similarities and differences between words.

Co-occurence probability values are very low, so it is better to work with the log of these values.

3.3 Character-level Features and Beyond

There are other possible features used in NLP. Let me give you a brief overview.

In character-level features your model works on the character-level which has its pros and cons. From its pros is the small number of vocabulary you have since the number of English characters are much less than the number of words, and there is almost no Out of Vocabulray Characters (OOV) which is unlike word-level representations. However, characters in themselves do not carry as much meaning as words.
WordPiece (a type of subword tokenization) is something between word-level, and character-level. For example, words with the same stem or root are divided into two parts (one part containing the stem, or root, and the other one containing the suffixes for instance). Therefore, you reduced the number of vocabulary while benfeting from the word-level representation. However, Out of Vocabulary Words (OOV) still exist.
There is also something called SentencePiece, and other tokenizers exist. You can find more about them here.
Moreover, there are Contextual Word Embedding that captures the word meaning when it appears in different contexts. Some of the ways for producing these Contextual Word Embeddings are using ELMo, and BERT (Bidirectional Encoder Representations from Transformers). More information are found here.

3. Modeling

This section needs many details for someone to understand it well. However, I will go briefly over some concepts to give some sense of the big picture without entering into the level of details as in the previous sections. I will also put some links for very good blog posts that illustrate some concepts very well including good visualizations.

3.1 Machine Learning Models

Machine Learning Models like Logistic Regression, Hidden Markov Model, Naive Bayes, and other Classification, and Clustering Algorithms. However, generally as the data increase, we tend to go more to deep-learning models while other Machine Learning Models’ performance flattens. I am talking generally, but there are exceptions and other considerations that are beyond the scope of this introductory blog post.

3.2 Deep Learning Models

3.2.1 RNN

The sequence of words is important in many NLP applications like Machine Translation, Question Answering, Text Summarization, and Chatbots. Therefore there are special types of Neural Networks specifically dedicated to capturing long-term dependency across sequences like the Vanilla RNN (Recurrent Neural Networks). Other variations of RNN are (LSTM (Long Short-Term Memory), and GRU (Gated recurrent units)) that overcome the RNN’s vanishing gradient problem.

3.2.2 Sequence to Sequence Models (Seq2Seq)

Sequence to sequence is important where the number of inputs and outputs are not necessarily the same like Machine Translation, Question Answering, Text Summarization, and Chatbots. It basically consists of:

Encoder: It captures the meaning out of the input sequence, and produce a fixed-size Context Vector that captures the meaning of input from the RNNs hidden states if we used RNNs in teh Encoder.
Decoder: It converts the context vector into the sequence of outputs.

The fixed-size Context Vector in seq2seq without attention has the major problem of reaching the information bottleneck with long sequences. However, if we increased the fixed size to capture long sequences, the model can overfit on short sequences. Here were attention comes to solve that problem.

3.2.3 Sequence to Sequence with Attention Models (Seq2Seq with Attention)

The major difference is that now the Context Vector is not fixed in size. Its size changes with the input sequence length so that it can capture all the information in the input sequence, and the Decoder can then focus on the relevant parts of the Context Vector during decoding by using a scoring method. Therefore, no information is dismissed.

For more information, you can read this blog post.

3.2.4 Transformers (Going beyond LSTMs)

Why not use attention everywhere without using any RNN or any of its variations (LSTMs or GRUs)? That is what has been proposed in the great paper Attention Is All You Need. This idea made a huge impact in the world of NLP.

For more information, you can read this blog post.

3.2.4 BERT (Bidirectional Encoder Representations from Transformers) – (The Transformer Encoder)

I have taken the following text as it is from this blog post because it is really well-written and summarizes what I want to say exactly:

“ELMo’s language model was bi-directional, but the openAI transformer only trains a forward language model. Could we build a transformer-based model whose language model looks both forward and backwards (in the technical jargon – “is conditioned on both left and right context”)?”

BERT has also made very great advancements not only in NLP but also in Information Retrieval (Yes it is used in Google Search). Again, I have taken the following text as it is from this blog post.

“In October 2019, Google announced its biggest update in recent times: BERT’s adoption in the search algorithm. Google had already adopted models to understand human language, but this update was announced as one of the most significant leaps in search engine history.“

BERT became very popular in NLP that several flavors of it have been trained and released for public use. For instance
Multilingual BERT (mBERT), and BERT for specific languages like AraBERT, and CAMeLBERT for the Arabic language, and so on.

For more information, you can read this blog post.

Finally

Thank you. I hope this post has been beneficial to you. I would appreciate any comments if anyone needed more clarifications or if anyone has seen something wrong in what I have written in order to modify it, and I would also appreciate any possible enhancements or suggestions. We are humans, and errors are expected from us, but we could also minimize those errors by learning from our mistakes and by seeking to improve what we do.

Allah bless our master Muhammad and his family.

References

https://www.udacity.com/course/natural-language-processing-nanodegree–nd892

https://huggingface.co/transformers/tokenizer_summary.html

https://jalammar.github.io/illustrated-bert/

https://rockcontent.com/blog/google-bert/

Share on Facebook

134 Comments

Most Voted

Newest Oldest

Inline Feedbacks

View all comments

slot terpercaya

2 years ago

Hi, just wanted to mention, I enjoyed this blog post.
It was inspiring. Keep on posting!

Author

Bahgat

2 years ago

Reply to slot terpercaya

Thank you Slot for your feedback. I appreciate it.

granite

2 years ago

For latest news you have to pay a quick visit web and on the web I
found this web page as a finest website for most up-to-date
updates.

Author

Bahgat

2 years ago

Reply to granite

Thank you Granite for your feedback. I appreciate it.

doyle

2 years ago

Good day very nice site!! Man .. Excellent ..
Amazing .. I will bookmark your website and take the feeds also?
I’m satisfied to find a lot of helpful information right here
within the post, we’d like work out more techniques on this regard, thank
you for sharing. . . . . .

Author

Bahgat

2 years ago

Reply to doyle

Good day too, Doyle! Thank you for your feedback. I really appreciate that you found this blog useful. You are welcome!

mike

2 years ago

Everything is very open with a precise explanation of the issues. It was really informative. Your website is very useful. Many thanks for sharing!

Author

Bahgat

2 years ago

Reply to mike

You are welcome! Thank you, Mike. I really appreciate this.

Elissa

2 years ago

Woah! I’m really loving the template/theme of this website.
It’s simple, yet effective. A lot of times it’s challenging to get that “perfect balance” between usability and visual appeal.

I must say you have done a fantastic job with this. In addition, the
blog loads super fast for me on Chrome. Exceptional Blog!

Author

Bahgat

2 years ago

Reply to Elissa

Thank you, Elissa. I really appreciate this.

jorjataverner

2 years ago

Excellent blog you have here. It’s hard to find quality writing like yours nowadays.
I truly appreciate people like you! Take care!!

Author

Bahgat

2 years ago

Reply to jorjataverner

Thank you, Jorjataverner, for your feedback. I really appreciate this.

bola88

2 years ago

Hi there I am so delighted I found your site, I really found you by accident, while I was browsing on Yahoo for something else, Anyways I am here now and would just like to say kudos for a tremendous post and a all round enjoyable blog (I also love the theme/design), I don’t have time to look over it all at the minute but I have book-marked it and also added in your RSS feeds, so when I have time I will be back to read a lot more, Please do keep up the great work.

Author

Bahgat

2 years ago

Reply to bola88

Thank you, bola, for your feedback. I really appreciate this.

glindamaddox

2 years ago

I savor, cause I discovered just what I was taking a look for.
You have ended my four day long hunt! God Bless you man. Have a great day. Bye

Author

Bahgat

2 years ago

Reply to glindamaddox

Thank you, glindamaddox, for these great words. I am really happy that you found this helpful. Glod bless you too> I wish you all the best. Have a great day too!

devonstanfield

2 years ago

Wow, this post is pleasant, my younger sister is analyzing these kinds of things, thus I am going to tell her.

Author

Bahgat

2 years ago

Reply to devonstanfield

Thank you, devonstanfield, for your feedback. Great! I really appreciate that you found this blog post pleasant, and I hope that your younger sister will find it useful as well.

Brooke

2 years ago

Thanks for sharing your thoughts on website. Regards

Author

Bahgat

2 years ago

Reply to Brooke

You are welcome, Brooke. Best Regards too.

kabartekno

2 years ago

I have been browsing online more than 3 hours today, yet I never found any interesting article like yours. It’s pretty worth enough for me. In my view, if all webmasters and bloggers made good content as you did, the web will be a lot more useful than ever before.

Author

Bahgat

2 years ago

Reply to kabartekno

Thank you so much, kabartekno for your words. I really appreciate that you found this useful.

Kiara

2 years ago

Spot on with this write-up, I honestly believe this website needs a lot more attention.
I’ll probably be back again to read more, thanks for the information!

Author

Bahgat

2 years ago

Reply to Kiara

Thank you so much, Kiara for your feedback. I really appreciate this.

idnpoker

2 years ago

Heya i’m for the first time here. I came across this board and I find It truly useful. It helped me out a lot. I hope to offer something back and aid others like you aided me.

Author

Bahgat

2 years ago

Reply to idnpoker

Thank you so much, idnpoker for your feedback. I really appreciate this.

Kristal

2 years ago

If ѕome one wishes expert νiew on the topic ⲟf blogging afterward i ѕuggest him/her to pay a visit this blog, Keeр up the nice job.

Author

Bahgat

2 years ago

Reply to Kristal

Thank you so much, Kristal for your feedback. I really appreciate this.

ngan

2 years ago

Hurrah! In the end I got a weblog from where I can truly obtain useful data regarding my study and knowledge.

Author

Bahgat

2 years ago

Reply to ngan

Thank you so much, ngan for your words. I really appreciate that you found this useful.

berniecescruggs

2 years ago

It’s going to be ending of mine day, except before ending I am reading this enormous paragraph to increase my experience.

Author