Formation of text corpus and frequency definition for the words in the Arabic language: problems and solutions


  • Олег Иванович Редькин St. Petersburg State University, 7-9, Universitetskaya nab., St. Petersburg, 199034, Russian Federation


Although the problem of formation of corpus on the material of the Indo-European languages, including Russian, is comparatively developed in relation to other languages and particularly Arabic, it is far from its final solution. The article deals with the problems and solutions for building the Arabic corpus, based on the material from the Internet and other available sources, and identifies the principles of data selection. The article also considers the results of formation of frequency dictionary of Arabic, as well as peculiarities of the Arabic phonology, morphology and script. Besides, the article studies some peculiarities of the stress in Arabic. The article is supplied with a list of the most common Arabic words with their frequency indexing. Refs 6. Tables 1.


Arabic, corpus, computer, data, proceeding, frequency, dictionary


Download data is not yet available.



AbdelRaouf A., Higgins C. A., Pridmore T., Khalil M. Building a multi-modal Arabic corpus (MMAC) // International Journal on Document Analysis and Recognition. 2010. Vol. 13 (Dec., 2010), N 4. P. 285–302.

Haslina H., Mat D. N., Atwell E. S. Connectives in the World Wide Web Arabic Corpus // World Applied Sciences Journal. 2013. Vol. 21 (Special Issue of Studies in Language Teaching and Learning). P. 67–72.

Kilgarriff A., Rundell M., Dhonnchadha E. U. Efficient corpus development for lexicography: building the New Corpus for Ireland // Language Resources and Evaluation. Vol. 40, N 2 (May, 2006). P. 127–152.

Mansour M. A. The Absence of Arabic Corpus Linguistics: A Call for Creating an Arabic National Corpus // International Journal of Humanities and Social Science. Vol. 3, N 12 (Special Issue — June 2013). P. 83–84.

Hammo B., Abuleil S., Lytinen S., Evens M. Experimenting with a Question Answering System for the Arabic Language // Computers and the Humanities. Vol. 38, N 4 (Nov., 2004). P. 397–415.

Ferguson Ch. Diglossia // Word. 1959. N 15. P. 325–340.



How to Cite

Редькин, О. И. (2014). Formation of text corpus and frequency definition for the words in the Arabic language: problems and solutions. Vestnik of Saint Petersburg University. Asian and African Studies, (1), 14–22. Retrieved from


