NEMLAR Corpus

01-05-2018

Version 1.0

Abstract

The first version of Nemlar corpus was produced within the NEMLAR project. This is a set of annotated Arabic texts. It was collected from 13 different domains and contains about 500,000 words.

The Arabic Language Processing team (ALP team) of Mohammed First University in Morocco enriched this corpus by adding the lemma label to all the words in the corpus. She also corrected some annotation errors in the first version. The new version is downloadable from the ALP team website.

This new version is in XML format. Each word is accompanied by the following tags:

  • Vowelized form
  • Lemma
  • POS tag
  • Clitics attached to the stem
  • Root
  • Pattern


  • Attribution (BY) — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
  • NonCommercial (NC) — You may not use the material for commercial purposes.
  • ShareAlike (SA) — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

AlKhalil Morpho Sys 2

19-03-2016

Version 2.1

Abstract

AlKhalil Morpho Sys is a morphosyntactic analyzer of standard Arabic words taken out of context. The system analyzes either diacritized words as partially or totally diacritized ones. In this paper, we present the second version of this analyzer. The correction of errors in the database of the first version, and enrichment of these database by missing data allowed us to develop a more accurate version with very high coverage since the percentage of analyzed words exceeds 99%. In addition, we have enriched the morphological features provided by this new version by lemma of the word and its pattern, which are very useful in many applications of Arabic language processing. Furthermore, with the new organization of these database and the improvements carried to its source code, this new version produces very fast analysis.


  • Attribution (BY) — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
  • NonCommercial (NC) — You may not use the material for commercial purposes.
  • ShareAlike (SA) — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.