Linguistics and Arabic Natural Language Processing (NLP) Introduction

In the name of Allah, most gracious and most merciful,

A lot of text here is taken as it is from Dr. Nizar Habash‘s papers and text book Introduction to Arabic Natural Language Processing. I tried to organize what I have understood in an easy way for me. In addiition to this, a lot of terminologies’ definitions are taken as they are from wikipedia.

Table of Contents

1. Introduction

After a while of my career shift to the field of Natural Language Processing (NLP), I have started to make more sense of the NLP big picture. However, when I wanted to learn more about Arabic NLP, I needed to understand more about linguistics since Arabic is a morphologically rich language, and I know that NLP is interdisciplinary between Computer Science and Linguistics. Moreover, I wanted to read more about linguistics when I read some of the papers in Arabic NLP from one of the great leaders in Arabic NLP (Dr. Nizar Habash). I know that he has two bachelors degrees, one in Computer Engineering and one in Linguistics and Languages. Therefore, if I want to specialize more in Arabic NLP, and understand his papers better, I have to improve my linguistics knowledge to keep in touch with his researches, and to learn more from him.

Therefore, I decided to share my understanding of Linguistics with those who came from a non-linguistic background like me. Moreover, I summarized some of my understanding of Dr. Nizar Habash’s Arabic NLP papers, including how to approach Arabic NLP. Finally, I have shared notes that I have taken while reading some of Dr. Nizar Habash’s papers. I hope this post will be useful to you.

This is not the full picture of Arabic NLP since I read only some of Dr. Nizar Habash’s papers. However, I hope that this will be a good brief introduction to Linguistics and Arabic NLP.

2. Linguistics Subfields ​1​

Machine Learning approaches to NLP require feature extraction. Therefore knowing the linguistics structures can help NLP engineers to design better features for machine learning approaches NLP. ​1​ Moreover, it is known that feature engineering is one of the most important steps in the Machine Learning pipeline as I have clearly explained in this blog post. Garbage in, garbage out. In addition to this, linguistic structure knowledge help in the error analysis of NLP systems. ​1​

2.1 Phonetics

The study of the sounds of human language. It studies how humans produce and perceive sounds, or in the case of sign languages, the equivalent aspects of sign.

2.2 Phonology

The study of sound systems in human languages. It studies how languages or dialects systematically organize their sounds (or signs, in sign languages). The term also refers to the sound system of any particular language variety. In other words, it is the study of how individual sounds or handshapes are combined into specific patterns.

2.3 Morphology

The study of the formation and internal structure of words. It is the study of words, how they are formed, and their relationship to other words in the same language. It analyzes the structure of words and their parts like stems, roots, prefixes, and suffixes. Moreover, it looks at parts of speech, intonation and stress, and how context can change a word’s pronunciation and meaning.

Dictionary-makers define one entry or unit as the largest unpredictable combinations of form and meaning. They call each of these units lexemes or lexical items, because they’re the parts of a lexicon which is another word for dictionary.

Example: Rabbits. Rabbit and -s are examples of the smallest unpredictable combinations of form and meaning. Linguists call these units morphemes, and the study of them is morphology.

Dividing language into morphemes is helpful since it helps us see patterns across languages. A separate word in a language may be a part of a word in another language.

2.4 Syntax

The study of the formation and internal structure of sentences. The study of how languages express relationships between words. It is the set of rules, principles, and processes that govern the structure of sentences in a given language, usually including word order. In other words, it is the study of how words group together to make sentences. One basic description of a language’s syntax is the sequence in which the subject (S), verb (V), and object (O) usually appear in sentences. We can see the similarities and differences by looking at language from the syntax perspective.


It is a simple way to keep track of sentences’ different parts by drawing word connections. Linguists often represent the structural relationships between words using a tree structure diagram, which is somehow similar to a family tree with nodes and branches.

2.5 Morphosyntax

Morphosyntax is another word for grammar. Linguists use the word grammar to talk about the language structural patterns, how a language puts morphemes together into words, words together into constituents, and constituents into sentences. A grammar is a description of how sentences go together in a language.

2.6 Semantics

The study of the meaning of words and sentences, and the many ways that we can describe them. There are different theories in linguistic semantics like Binary Feature Analysis, Natural Semantic Metalanguage, Formal Semantics, Conceptual Semantics, Cognitive Semantics, Lexical Semantics, Cross-cultural Semantics, and Computational Semantics.

2.6.1 Binary Feature Analysis

Precisely describes words that are part of a taxonomy, like words for family members.

2.6.2 Natural Semantic Metalanguage

A linguistic theory where words can be broken down into other, more basic units of meaning.

2.6.3 Cognitive Semantics

A linguistic theory where metaphors draw connections between abstract concepts like time and concrete concepts like physical location.

Trying to understand the word meaning from a dictionary is a great skill. However, meaning is more complicated, unclear, and requires a range of semantic tools.

2.7 Pragmatics

It is the area of linguistics that puts meaning into context. It is the study of how context contributes to meaning. In other words, it is the study of the meanings of words and sentences in a larger social context. Theories of pragmatics go hand-in-hand with theories of semantics, which studies aspects of meaning which are grammatically or lexically encoded.

2.8 Orthography

The development of writing is influenced by lots of things: The structure of the languages they represent, the tools used to produce them, and who is powerful in a given place and time. This set of conventions that are used for representing language in writing are called a Writing System or Orthography.

3. Important Terminologies ​2​

These are some of the terminologies I encountered throughout discovering Arabic NLP and, and found to be useful for anyone that will read this blog post.

3.1 Treebanks

A database of syntactic analyses of Arabic sentences.

3.2 CODA (Conventional Orthography for Dialectal Arabic)

Computational Approaches to Modeling Language (CAMeL) researchers, in collaboration with researchers in a number of universities, have developed a CODA (Conventional Orthography for Dialectal Arabic) — a computational “standard” for writing Arabic dialects. The CODA convention is close to Standard Arabic orthography while maintaining some of the unique morphological and lexical features of Dialectal Arabic.

3.3 CODA-Star

Earlier versions of CODA targeted specific Arabic dialects: Egyptian, Palestinian, Tunisian, Algerian, and Gulf. In its most recent iteration, the guidelines for CODA-Star (as in for any dialect) cover 28 city dialects. More information is found here.

3.4 Morphological Analysis

Its target is to produce all possible morphological/POS (Part of Speech) readings of a word out of context.

3.5 Morphological Generation

It is the reverse of morphological analysis. It is the process in which we map from an underlying representation of a word to a surface form (whether orthographic or phonological).

3.6 Morphological Disambiguation

Its target is to tag the word in context. This task for English is referred to as POS tagging since the standard POS tag set, though only comprising 46 tags, completely disambiguates English morphologically. In Arabic, the corresponding tag set may comprise upwards of 330,000 theoretically possible tags, so the task is much harder. Reduced tag sets have been proposed for Arabic, in which certain morphological differences are conflated, making the morphological disambiguation task easier. The term POS tagging is usually used for Arabic with respect to some of the smaller tag sets.

3.7 Transliteration

The process of mapping the orthographic symbols used in one script into another. Like converting Arabizie script to Arabic script.

3.8 Kashida or Kasheeda

Kashida or Kasheeda (Arabic: کشیده‎; “extended”, “stretched”, “lengthened”) is a type of justification in the Arabic language and in some descendant cursive scripts. For instance, الحمد is converted to الحمــــــد.

3.9 Clitic

It is a morpheme that has the syntactic characteristics of a word but shows evidence of being phonologically bound to another word. In this respect, a clitic is distinctly different from an affix, which is phonologically and syntactically part of the word.

3.10 Etymology

It is the branch of linguistics concerned with the history of the forms and meanings of words.

3.11 Satistical Parsing

It is a group of parsing methods within natural language processing. The methods have in common that they associate grammar rules with a probability.

3.12 Finite-State Machine (FSM)

A finite-state machine (FSM) or finite-state automaton (FSA, plural: automata), finite automaton, or simply a state machine, is a mathematical model of computation. It is an abstract machine that can be in exactly one of a finite number of states at any given time. The FSM can change from one state to another in response to some inputs; the change from one state to another is called a transition.

3.13 Finite-State Transducer (FST)

It is a finite-state machine with two memory tapes, following the terminology for Turing machines: an input tape and an output tape. This contrasts with an ordinary finite-state automaton, which has a single tape.

3.14 Allophone

In phonology, an allophone is one of a set of multiple possible spoken sounds, or phones, or signs used to pronounce a single phoneme in a particular language.

3.15 BLEU Score

BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another.

3.16 BERT

Bidirectional Encoder Representations from Transformers (BERT) is a transformer-based machine learning technique for natural language processing (NLP) pre-training developed by Google. It is the Encoder part of the transformer.

3.17 Controlled Experiment

It is a scientific test done under controlled conditions, meaning that just one (or a few) factors are changed at a time, while all others are kept constant.

3.18 ECAL

The Egyptian Colloquial Arabic Lexicon.


It is a linguistically accurate, large-scale morphological analyzer. It follows the part-of-speech (POS) guidelines used by the Linguistic Data Consortium for EGY (Maamouri et al., 2012b). It accepts multiple orthographic variants and normalizes them to CODA.

4. Arabic Linguistic Facts Making Arabic NLP Challenging ​3​, ​4​, ​5​, ​6​

4.1 Morphological Richness

Arabic is a morphologically rich and complex language so Arabic words have a large number of forms. It employs a combination of templatic, affixational, and cliticization morphological operations to realize a large number of features such as gender, number, person, case, state, aspect, voice, and mood, in addition to a number of attachable pronominal, preposition, and clitics such as conjunctions, negative particles, future particles, etc.

4.2 Language Variations

The Arabic language now has three important variations that co-exist in daily Arab lives.

4.2.1 Classical Arabic (CA)

This is the Classical Arabic Language that exists in the Holy Quran, and in the Sunnah (Teachings of Prophet Muhammed peace be upon him). It is very powerful, rich, and expressive. However, Arabs do not use it in their daily lives except for reading the Holy Quran, and the Sunnah, and in their prayers. The Arabic language diacritics are an important part of the Classical Arabic for Muslims to properly understand and pronounce the Holy Quran, and Sunnah.

This language has its lexicons, orthography

4.2.2 Modern Standard Arabic (MSA)

This is the official prestigious standard of the media, literature, and education. But this is mostly not used in Arabs daily lives communications including social media.

4.2.3 Dialectical Arabic (DA)

The other variants are the so-called dialects of daily speech and social media. There is a degree of lexical variety reflecting regions with possibly different dialects. As such, it is possible that some MSA words may be more similar to the dialect of a certain region and different from another. Moreover, each of the dialects could slightly change between cities within the same country.

Arabic dialects are often classified regionally (such as Egyptian, Levantine, Gulf, etc.) or subregionally (e.g., Lebanese, Syrian, Jordanian, etc.). These classifications are generally problematic because of the continuous nature of language variation.

Although DAs are historically related to MSA, there are many phonological, morphological and lexical differences between them. Unlike CA and MSA, DAs have no standard orthographies or language academies.

Most tools and resources developed for natural language processing (NLP) of Arabic are designed for MSA. Such resources are quite limited when it comes to processing DA. DAs mostly differ from MSA phonologically, morphologically, and lexically (Gadalla, 2000; Holes, 2004). These differences are not modeled as part of MSA NLP tools, leaving a gap in coverage when using them to process DAs.

4.2.4 Arabizi

Dialectal Arabic text that appears on social media in a non-standard romanization.

Starting from here, anything I will talk about is for MSA, DA, and Arabizi because CA is not currently used in education, media, books, and in the Arabs’ daily life communications.

4.3 Orthographic Ambiguity

Arabic is commonly written with optional diacritical marks – which are often omitted – leading to rampant ambiguity. These diacritical marks are for short vowels and consonantal gemination. The missing diacritics are not a major challenge to literate native adults. However, their absence is the main source of ambiguity in Arabic NLP.

4.4 Orthographic Inconsistency (Orthographic Noise)

Noise in written text is a common problem for NLP when working in social media and non-edited text. For MSA, Zaghouani et al. (2014) report that 32% of words in MSA comments online have spelling errors. Eskander et al. (2013) also report close to 24% of Egyptian Arabic words having non-CODA-compliant spelling. Dialectal Arabic text is also known to appear on social media in a non-standard romanization, often called Arabizi (Darwish, 2013).

Moreover, some letters in Arabic are often spelled inconsistently which leads to an increase in both sparsity (multiple forms of the same word) and ambiguity (same form corresponding to multiple words). These errors are so common, that in Arabic NLP, Alif/Ya normalization is standard preprocessing (Habash, 2010), and Alif/Ya specification is done as postprocessing (El Kholy and Habash, 2010).

5. How to Approach Arabic NLP ​7​, ​8​, ​9​

Given all these facts about the Arabic Language, there are special important considerations on approaching Arabic NLP. These are the things that I understood after reading some of Dr. Nizar Habash’s papers.

Text Preprocessing, Feature Engineering, and Modeling are from the core stages in the NLP pipeline in general.

5.1 Text Preprocessing

There is no intended order in the text-proprocessing techniques I mentioned here. I mean 5.1.1 is not necessarily before 5.1.4. I am just listing the possible text-preprocessing techniques. In other words, order is not intended.

5.1.1 Transliteration

When working with Arabic text, it is sometimes convenient or even necessary to use alternate transliteration schemes. Buckwalter transliteration (Buckwalter, 2002) and its variants, Safe Buckwalter and XML Buckwalter, for example, map the Arabic character set into ASCII representations using a one-to-one map. These transliterations are used as input or output formats in Arabic NLP tools such as MADAMIRA (Pasha et al., 2014), and as the chosen encoding of various resources such as
the SAMA database (Graff et al., 2009). Habash-Soudi-Buckwalter (HSB) (Habash et al., 2007) is another variant on the Buckwalter transliteration scheme that includes some non-ASCII characters whose pronunciation is easier
to remember for non-Arabic speakers.

5.1.2 Orthographic Normalization

Due to Arabic’s complex morphology, it is necessary to normalize text in various ways in order to reduce noise and sparsity (Habash, 2010). The most common of these normalizations are:

  • Unicode normalization: Which includes breaking up combined sequences (e.g. لا to ل and ا), converting character variant forms to a single canonical form (e.g. ــع, عـ, and ــعــ to ع), and converting extensions to the Arabic character set used for Persian and Urdu to the closest Arabic character (e.g. گ to ك).
  • Dediacritization: Which removes Arabic diacritics which occur infrequently in Arabic text and tend to be considered noise. These include short vowels, shadda (gemination marker), and the dagger alif (e.g. مُدَرِّسَةُ to مدرسة).
  • Removal of unnecessary characters: including those with no phonetic value, such as the tatweel (kashida) character.
  • Letter variant normalization: For letters that are so often misspelled. This includes normalizing all the forms of Hamzated Alif (أ, إ, آ) to bare Alif (ا), the Alif-Maqsura (ى) to Ya (ي), the Ta-Marbuta (ة) to Ha (ه), and the non-Alif forms of Hamza (ؤ, ئ) to the Hamza letter (ء). Alif/Ya normalization is standard preprocessing (Habash, 2010), and Alif/Ya specification is done as postprocessing (El Kholy and Habash, 2010).
5.1.3 Character-level

They could be used with Arabizi as shown in this paper since they capture the complexity of the noise, variations, and mistakes. This is due to Arabizi complexities and a high number of out-of-vocabulary words, so a word-level neural model cannot be used as an end-to-end solution with Arabizi.

Due to the aforementioned complexities and a high number of out-of-vocabulary words, a word-level
neural model cannot be used as an end-to-end solution.

5.1.4 Global Lower Casing and Repetition Elision to two characters

This is used with Arabizi to handle common variations in social media text such as inconsistent capitalization and emphatic repetitions.

5.1.5 Handling Non-Arabic Words (Foreign Words)

English (or other foreign words to Arabic) are mapped to the output, without any attempt to normalize or modify their spelling. This can be used with Arabizi when the task is more oriented towards modeling identification of foreign words – or from a different perspective, detection of Arabizi words. This is not always simple given that some Arabizi words are ambiguous with English words.

Accented characters can be converted to their unaccented versions in the standard 7-bit ASCII.

5.1.6 Handling emojis and emoticons

Emojis and emoticons could be handled in a comparable way to foreign words. Although for a small set of commonly used emoticons, a dictionary could be used and preprocess them into the (#) symbol.

5.1.7 Other Preprocessing Steps
  • Remove invalid characters
  • Normalize white spaces
  • Remove lines without any Arabic characters
  • Sentence-level Tokenization (split each line into sentences)
  • Word-level Tokenization (split on white spaces and punctuations)
  • Remove URLs
  • For (Tweets) remove Twitter usernames
  • For (Poem Verses) separate the halves of each verse by using the [SEP] token if you are using BERT
  • Morphological analysis has been shown to improve the performance of DID (Dialect Identification) systems for a small number of dialects (Darwish et al., 2014). However, the number and sophistication of morphological analysis and segmentation tools for DA are very limited (Pasha et al., 2014), cover only a small number of dialects (Habash and Rambow, 2006; Habash et al., 2012b; Khalifa et al., 2017) and unavailable for most of the others

5.2 Feature Engineering ​10​

5.2.1 Word n-grams Features

Word unigrams are extensively used in text classification tasks. They are useful for Dialect Identification (DID) tasks as they depict words unique to some dialects. Moreover, lexical variations are prominent and could be predictive for certain regions, countries, and cities.

5.2.2 Character n-grams Features

Character n-grams have shown to be the most effective in language and dialect identification tasks (Zampieri et al., 2017). For DAs, Character n-grams are good at capturing several morphological features that are distinctive between Arabic dialects, especially the clitical and affixal use

5.2.3 Language Model (LM) Probability Scores for Dialect Identification (DID)

You could train n LMs each pertaining to the n dialects on the CORPUSes you have. Then you could score your data sentences using these LMs. Then use the probability scores of the sentence as features. Thus, each sentence will have n probability scores, one for each dialect. The probability scores measure how close each sentence is to the dialect.

5.3 Modeling

5.3.1 Neural Models

Deep learning models, specifically sequence-to-sequence (Seq2Seq) RNN models have shown a lot of
success in the task of character-based transliteration for a number of languages (Rosca and Breuel, 2016;
Kundu et al., 2018; Dershowitz and Terner, 2020).

The noise and variability can be mitigated by modeling the Arabizi-to-Arabic transliteration as a character-level sequence-to-sequence process.

5.3.2 BERT

Arabic has benefited from extensive efforts in building dedicated pre-trained language models, achieving state-of-the-art results in a number of NLP tasks, across both Modern Standard Arabic (MSA) and Dialectal Arabic (DA) (Antoun et al., 2020; Abdul-Mageed et al., 2020a).

5.3.3 Multinomial Naive Bayes (MNB) classifier

MNB is a variation of Naive Bayes that estimates the conditional probability of a token given its class as the relative frequency of the token t in all documents belonging to class c. MNB has proven to be suitable for classification tasks with discrete features (e.g., word or character counts or representation for text classification) (Manning et al., 2008).

5.3.4 Character-based Language Model (LM)

Character-based LMs leverage subword information and are generally good for capturing particular peculiarities that are specific to certain dialects such as the use of certain clitics, affixes, or internal base word structure.

Furthermore, character-level models mitigate the ineffectiveness of word-based LMs caused by the presence of out-of-vocabulary words (OOVs) that are prominent in dialects, due to the lack of standard orthography (Habash et al., 2012a).

6. Notes from Dr. Nizar Habash’s Papers

These are some of the notes I have taken on reading Dr. Nizar Habash’s papers. Some of them need to be read in the paper context to be properly understood. However, I have put them here out of context to avoid overloading the blog post so the reader can read that part in the paper context since I have taken them as they are. Reading these notes will give you some insights into Arabic NLP.

6.1 Automatic Gender Identification and Reinflection in Arabic ​5​

  • The Rule-based model also has very high precision, comparable to that of the Joint model; but it trades off with the lowest recall. This is expected and typical of rule-based models.
  • The main reason for this setup is that character-level representations are reported to be good in capturing and learning morphological aspects (Ling et al., 2015; Kim et al., 2016), which is important for a morphologically rich language like Arabic. Furthermore, character-level NMT modeling requires less vocabulary and helps reduce out-of-vocabulary by translating unseen words.
  • The BLEU scores are very high because most of the words are not changed between input and reference.
  • There were also a few cases of very long repetitions in the output; as well as reduced output – simply leading to sentence length mismatch. All of these phenomena are unsurprising side effects of using characterbased NMT models.

6.2 A Morphological Analyzer for Egyptian Arabic ​6​

  • Another important morphological difference from MSA is that DAs in general and not just EGY drop the case and mood features almost completely.
  • However, to handle input in a variety of spellings, we extend our analyzer to accept non-CODA-compliant word forms but map them only to CODA-compliant forms as part of the analysis.
  • We did not attempt to correct or change the ECAL letter spelling; we only added diacritics.
  • The rules are specified in a simple format that is interpreted and applied by a separate rule processing script. Developing the script and writing the rules took about 3 person-months of effort.
  • Overall, these are positive results that suggest the next steps should involve additional orthographic and morphological extensions and paradigm completion.

6.3 Fine-Grained Arabic Dialect Identification ​10​

  • For instance, several evaluation campaigns were dedicated to discriminating between language varieties (Malmasi et al., 2016; Zampieri et al., 2017). This is not surprising considering the importance of automatic DID for several NLP tasks, where prior knowledge about the dialect of an input text can be helpful, such as machine translation (Salloum et al., 2014), sentiment analysis (Al-Twairesh et al., 2016), or author profiling (Sadat et al., 2014).
  • High-order character n-grams extracted from speech or phonetic transcripts and i-vectors (a low dimensional representation of audio recordings) were shown to be the most successful and efficient features (Butnaru and Ionescu, 2018), while deep learning approaches (Belinkov and Glass, 2016) did not perform well.
  • Arabic speakers usually resort to repeated code-switching between their dialect and MSA (Abu-Melhim, 1991; Bassiouney, 2009), creating sentences with different levels/percentages of dialectness.
  • Thus, the high degree of similarity among some dialects shows that discriminating between dialects on the word-level is rather challenging. This can affect the accuracy of our models due to the increase of confusability among similar dialects.
  • The latter two had lower accuracy than the simple Language Model baseline, which could be explained by the small size of our training data.
  • We use Term Frequency-Inverse Document Frequency (Tf-Idf) scores as it has been shown to empirically outperform count weights.
  • The scores are calculated per class for our best system, which can provide a better understanding on the confusability of the classes and sensitivity of our model.
  • We hypothesize that the morphological features in the words’ structure are well captured within character LMs.
  • Still, the of use word unigrams features alone (row g.) is able to beat 1-to-5-grams character features (row f.)
  • Interestingly, on average, it takes 1.52 sentences to predict the right class of a given sentence. With more examples of text by a writer, our system can confidently determine the correct dialect of the writer with a high accuracy reaching 100%.
  • We showed that using our best model, we can identify the exact city of a speaker at an accuracy of 67.9% for sentences with an average length of 7 words (a 9% relative error reduction over the state-of-the-art technique for Arabic dialect identification) and reach more than 90% when we consider 16 words.

6.4 The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models ​9​

  • Of course, none of the datasets was purely MSA or DA; however, based on known dataset variants, we observe that having about 40% or fewer MSA labels strongly suggests that the dataset is dialectal (or a strong dialectal mix).
  • These observations suggest that the size of pretraining data has limited and inconsistent effect on the fine-tuning performance. This is consistent with Micheli et al. (2020), where they concluded that pre-training data size does not show a strong monotonic relationship with fine-tuning performance in their controlled experiments on French corpora.
  • NER is the most sensitive to the pre-trained model variant (16.2%), followed by sentiment analysis (8.2%), dialect identification (3.8%), poetry classification (1.3%), and POS tagging (1.3%). This indicates the importance of optimal pairing of pre-trained models and fine-tuning tasks.
  • In all the cases except Gumar (11 out of 12), we obtain the best performance where the model has the lowest OOV rate.
  • This suggests that having a wide language variety in pre-training data can be beneficial for DA subtasks, whereas variant proximity of pre-training data to fine-tuning data is important MSA and CA subtasks.
  • The best model on average is AraBERTv02 (X3); it wins or ties for a win in six out of 12 subtasks (four MSA and two DA). Our CAMeLBERT-Star is second overall on average, and it wins or ties for a win in five out of 12 subtasks (three DA, one MSA, one CA). Interestingly, the two systems are complementary in their performance and between the two they win or tie for a win in 10 out of 12 subtasks. The two remaining subtasks are uniquely won by MARBERT (X7) (NADI, DA), and ARBERT (X8) (ANERcorp, MSA). In practice, such complementarity can be exploited by system developers to achieve higher overall performance.
  • Our results show that pre-training data and subtask data variant proximity is more important than pre training data size.
  • We outline here such a setup: the user has access to three versions of the models: CAMeLBERT’s CA, MSA, and Mix. If the task data is known a priori to be CA, then we select the CAMeLBERT-CA model; if the task data is known to be MSA, we select the CAMeLBERT-MSA model; otherwise, we use the CAMeLBERT-Mix model (for dialects, i.e.).


Thank you. I hope this post has been beneficial to you. I would appreciate any comments if anyone needed more clarifications or if anyone has seen something wrong in what I have written in order to modify it, and I would also appreciate any possible enhancements or suggestions. We are humans, and errors are expected from us, but we could also minimize those errors by learning from our mistakes and by seeking to improve what we do.

Allah bless our master Muhammad and his family.


Bibliographical References

  1. 1.
    Bender EM. Linguistic Fundamentals for Natural Language Processing: 100 Essentials from Morphology and Syntax. Vol 6. Morgan & Claypool Publishers; 2013:1–184.
  2. 2.
    Habash NY. Introduction to Arabic Natural Language Processing. Vol 3. Morgan & Claypool Publishers; 2010:1–187.
  3. 3.
    Al Khalil M, Habash N, Jiang Z. A Large-Scale Leveled Readability Lexicon for Standard Arabic. In: Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association; 2020:3053–3062.
  4. 4.
    Habash N, Eryani F, Khalifa S, et al. Unified Guidelines and Resources for Arabic Dialect Orthography. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA); 2018.
  5. 5.
    Habash N, Bouamor H, Chung C. Automatic Gender Identification and Reinflection in Arabic. In: Proceedings of the First Workshop on Gender Bias in Natural Language Processing. Association for Computational Linguistics; 2019:155–165. doi:10.18653/v1/W19-3822
  6. 6.
    Habash N, Eskander R, Hawwari A. A Morphological Analyzer for Egyptian Arabic. In: Proceedings of the Twelfth Meeting of the Special Interest Group on Computational Morphology and Phonology. Association for Computational Linguistics; 2012:1–9.
  7. 7.
    Obeid O, Zalmout N, Khalifa S, et al. CAMeL tools: An open source python toolkit for Arabic natural language processing. In: Proceedings of the 12th Language Resources and Evaluation Conference. ; 2020:7022–7032.
  8. 8.
    Shazal A, Usman A, Habash N. A Unified Model for Arabizi Detection and Transliteration using Sequence-to-Sequence Models. In: Proceedings of the Fifth Arabic Natural Language Processing Workshop. Association for Computational Linguistics; 2020:167–177.
  9. 9.
    Inoue G, Alhafni B, Baimukan N, Bouamor H, Habash N. The interplay of variant, size, and task type in Arabic pre-trained language models. arXiv preprint arXiv:210306678. Published online 2021.
  10. 10.
    Salameh M, Bouamor H, Habash N. Fine-Grained Arabic Dialect Identification. In: Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics; 2018:1332–1344.
Notify of

Inline Feedbacks
View all comments
Would love your thoughts, please comment.x