The first version of Nemlar corpus was produced within the NEMLAR project. This is a set of annotated Arabic texts. It was collected from 13 different domains and contains about 500,000 words.
The Arabic Language Processing team (ALP team) of Mohammed First University in Morocco enriched this corpus by adding the lemma label to all the words in the corpus. She also corrected some annotation errors in the first version. The new version is downloadable from the ALP team website.
This new version is in XML format. Each word is accompanied by the following tags:
AlKhalil Morpho Sys is a morphosyntactic analyzer of standard Arabic words taken out of context. The system analyzes either diacritized words as partially or totally diacritized ones. In this paper, we present the second version of this analyzer. The correction of errors in the database of the first version, and enrichment of these database by missing data allowed us to develop a more accurate version with very high coverage since the percentage of analyzed words exceeds 99%. In addition, we have enriched the morphological features provided by this new version by lemma of the word and its pattern, which are very useful in many applications of Arabic language processing. Furthermore, with the new organization of these database and the improvements carried to its source code, this new version produces very fast analysis.