Peculiarities of the Arabic Language Processing: Morphological Modeling

Authors

  • Olga A. Bernikova St. Petersburg State University
  • Natalia A. Kizhaeva St. Petersburg State University

DOI:

https://doi.org/10.21638/spbu13.2023.302

Abstract

The paper deals with the features of morphological modeling of the Arabic language based on the definition of the specifics of its formalization. Morphological modeling is one of the key stages of automatic text analysis and includes tools for building a word form to a stem, root, definition of a part of speech, automatic construction (generation) of a given word form, etc. The objectives of the study are interdisciplinary in nature and include both the theoretical aspects of studying the features of the Arabic language, which are most relevant for its automatic processing, and the study of existing morphological analyzers and determining the specifics of their work. The practical part is based on testing the CAMeL TOOLS, one of the advantages of which is its comprehensive nature, which allows both preprocessing of text and solving applied problems, including sentiment analysis. The criteria for selecting examples for testing took into account the features of the Arabic language, which are difficult for its formalization (segmentation of functional words with continuous spelling, morphological and lexical homonymy, etc.). The variability of the generalized concept of "the Arabic language" is taken into account, which combines classical Arabic, Modern Standard Arabic and modern Arabic dialects. Testing tools for morphological modeling allows us to draw conclusions about the need to improve the terminological apparatus, the variability of which is noted in the description of word forms. Such kind of variation (divergence from the concepts accepted in general linguistics) potentially leads to a distortion of the results of lexico-semantic analysis. During the analysis, some gaps were noted related to the definition of part-of-speech belonging, the description of word forms, etc. The results of the study are relevant both for linguistic research and for improving the development of software applications aimed at processing the Arabic text.

Keywords:

Arabic language, morphological modeling, analyzer, processing

Downloads

Download data is not yet available.
 

References

Национальная стратегия развития искусственного интеллекта на период до 2030 года. 2019. 10 окт. № 490. URL: http://static.kremlin.ru/media/events/files/ru/AH4x6HgKWANwVtMOfPDhcbRpvd1HCCsv.pdf (дата обращения: 04.08.2022).

Buckwalter T. Buckwalter Arabic morphological analyzer version 1.0 // Linguistic Data Consortium, University of Pennsylvania. 2002. URL: https://doi.org/10.35111/7vzm-mb15 (дата обращения: 04.08.2022).

تقرير حالة اللغة العربية ومستقبلها إعداد وإشراف وزارة الثقافة والشباب في دولة اإلمارات العربية المتحدة الرقم الدولي

[Отчет о состоянии арабского языка и его будущем // Министерство культуры и молодежи Объединенных Арабских Эмиратов. 2021. 683 с.] (На араб. яз.)

Obeid O., Zalmout N., Khalifa S., Taji D., Oudah M., Alhafni B., Inoue G., Eryani F., Erdmann A., Habash N. CAMeL Tools: An Open Source Python Toolkit for Arabic Natural Language Processing // Proceedings of the 12th language resources and evaluation conference. Marseille, France. European Language Resources Association. 2020. P. 7022–7032.

Callaos N. Rigor and Inter-disciplinary Communication: Intellectual Perspectives from Different Disciplinary and Inter-Disciplinary Fields. Independently published, 2020. 100 p.

Beesley K. Timothy Buckwalter, and Stewart Newton. Two-Level Finite-State Analysis of Arabic Morphology // The Seminar on Bilingual Computing in Arabic and English, Cambridge: University of Cambridge, 1989. P. 63–72.

Beesley K. Arabic Morphology Using Only Finite-State Operations // COLING-ACL’98 Proceedings of the Workshop on Computational Approaches to Semitic languages. Montreal, 1998. P. 50–57.

Koskenniemi K. Two-Level Morphology: A General Computational Model for Word-Form Recognition and Production. PhD thesis. Helsinki: University of Helsinki, 1983. 164 p.

Algarni M. Light Morphology and Arabic Information Retrieval. A thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy. Canterbury: University of Canterbury, 2016. 157 p.

Buckwalter T. Buckwalter Arabic morphological analyzer version 1.0 LDC2002L49 Web Download. Philadelphia: Linguistic Data Consortium, 2002. https://doi.org/10.35111/7vzm-mb15

Maamouri M., Bies A. Developing an Arabic treebank: Methods, guidelines, procedures, and tools // Proceedings of the Workshop on Computational Approaches to Arabic Script-based languages. Geneva, Switzerland. COLING. 2004. P. 2–9.

Maamouri M., Bies A., Buckwalter T., Mekki W. The penn Arabic treebank: Building a large-scale annotated Arabic corpus // NEMLAR conference on Arabic language resources and tools. 2004. Vol. 27. P. 466–467.

Habash N., Rambow O. MAGEAD: A morphological analyzer and generator for the Arabic dialects // Proceedings of 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Sydney: Association for Computational Linguistics, 2006. P. 681–688.

Soudi A., Cavalli-Sforza V., Jamari A. A computational lexeme-based treatment of Arabic morphology // Proceedings of the Arabic Natural Language Processing Workshop, Conference of the Association for Computational Linguistics (ACL 2001). 2001. P. 50–57.

Cavalli-Sforza V., Soudi A., Mitamura T. Arabic morphology generation using a concatenative strategy // Proceedings of the 1st North American chapter of the Association for Computational Linguistics Conference. Stroudsburg: Association for Computational Linguistics, 2000. P. 86–93.

Habash N. Arabic morphological representations for machine translation // Arabic Computational Morphology: Knowledge-Based and Empirical Methods / A. Soudi, A. van den Bosch, G. Neumann (eds). Dordrecht: Springer Netherlands, 2007. P. 263–285.

Alothman A., Alsalman A. M. Arabic Morphological Analysis Techniques // International Journal of Advanced Computer Science and Applications. 2020. Vol. 11, no. 2. P. 214–222.

Zalmout N., Habash N. Don’t throw those morphological analyzers away just yet: Neural morphological disambiguation for Arabic // Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Copenhagen, 2017. P. 704–713.

Zalmout N., Habash N. Adversarial multitask learning for joint multi-feature and multi-dialect morphological modeling // Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence: Association for Computational Linguistics, 2019. P. 1775–1786.

Pasha A., Al-Badrashiny M., Diab M., Habash N., Pooleery M., Rambow O. MADAMIRA v2.0 User Manual. Center for Computational Learning Systems. Columbia University, 2015. 40 p.

Abdelali A., Darwish K., Durrani N., Mubarak H. Farasa: A fast and furious segmenter for Arabic // Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: Demonstrations. 2016. P. 11–16.

Darwish K., Mubarak H. Farasa: A new fast and accurate Arabic word segmenter // Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). 2016. P. 1070–1074.

Manning C. D., Surdeanu M., Bauer J. The Stanford CoreNLP natural language processing toolkit // Proceedings of 52nd annual meeting of the association for computational linguistics: System demonstrations. 2014. P. 55–60.

Loper E., Bird S. Nltk: The natural language toolkit // Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics. 2002. P. 63–70.

Taji D., Khalifa S., Obeid O., Eryani F., Habash N. An Arabic Morphological Analyzer and Generator with Copious Features // Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology. Brussels: Association for Computational Linguistics, 2018. P. 140–150.

Taji D., Khalifa S., Obeid O., Eryani F., Habash N. An Arabic Morphological Analyzer and Generator with Copious Features // Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology. Brussels: Association for Computational Linguistics, 2018. P. 140–150.

Habash N. Arabic Morphological Representations for Machine Translation // Arabic Computational Morphology: Knowledge-based and Empirical Methods / Antal van den Bosch et al. (eds). Text, Speech and Language Technology, vol. 38. Springer, Dordrecht. 2007. https://doi.org/10.1007/978-1-4020-6046-5_14

Берникова О. А. Арабский язык // Грамматика и семантика восточного текста. Квантитативные характеристики / отв. ред. В. Б. Касевич. СПб., 2011. С. 35–48.

Published

2023-12-05

How to Cite

Bernikova, O. A., & Kizhaeva, N. A. (2023). Peculiarities of the Arabic Language Processing: Morphological Modeling. Vestnik of Saint Petersburg University. Asian and African Studies, 15(3), 459–484. https://doi.org/10.21638/spbu13.2023.302