Digital Islamic Humanities

Digital Islamic Humanities

Noor intelligent Arabic word stemmer engine

Document Type : Original Article

Author
Programmer engineer of intelligent processing group of computer research center of Islamic sciences
Abstract
The term "stemmer" refers to an algorithm used in Natural Language Processing (NLP) for morphological analysis, aimed at extracting and representing the root or base form of words. In other words, stemming is achieved by removing prefixes, suffixes, and vowels from conjugated verbs, derived nouns, and other word forms. The purpose of an Arabic stemmer is to reduce each word to its base form while preserving its semantic identity and syntactic function, thus facilitating efficient indexing, searching, and categorization of large Arabic text corpora. This tool plays a vital role in information retrieval, document classification, machine translation, summarization, and question-answering systems. 
A unique feature of the intelligent Arabic stemmer is its simultaneous use of rule-based, data-driven, and learning-based methods. As a result, the performance of this engine surpasses that of other stemming engines at a significant level. This engine utilizes Arabic text data from both classical and modern sources, combined with intelligent morphological analyses and rule-based refinement techniques grounded in the structured nature of the Arabic language. The integration of these three methods is highly effective in resolving ambiguities in words that do not conform neatly to Arabic language rules, contributing to its overall efficiency. This capability has allowed it to overcome some of the challenges faced by other stemming engines. 
Furthermore, this engine has been compared and evaluated using human-generated stems provided by experts, and its accuracy, compared to other stemmers, is of a very high quality. Another valuable feature of this engine is its ability to offer stems at different levels, which is a new and highly beneficial capability that can address the needs of a wide range of users.
Keywords

1. Alahi-Manesh, M. H. (2015). Intelligent Morphological Disambiguation of Noor. *Rahavard Noor*, Winter 2015, No. 53, 13-18.
2. Badil Yaqoub, E. (2004). *Al-Mu'jam Al-Mufassal Fi Al-Jamoo*. Dar Al-Kutub Al-Ilmiyyah, Beirut, Lebanon.
3. Danesh, S. M. (2014). Intelligent Morphological Analyzer of Noor. *Rahavard Noor*, Winter 2014, No. 49, 15-23.
4. Soryani, H. (2016). The Lexical Network of the Arabic Language Using Semi-Automatic Processes in Islamic Sciences Data. *Rahavard Noor*, Winter 2016, No. 57, 47-56.
5. Soryani, H., & Minaei, B. (2011). Intelligent Tagging System for Arabic Speech Elements: Morphological Layer. *Rahavard Noor*, Spring 2011, No. 34, 18-28.
6. Mustafa Ibrahim, Al-Ziyat, A. H., Abd Al-Qader Hamid, Al-Najjar, M. A. (2008). *Al-Mu'jam Al-Wasat*. Al-Sadiq Printing and Publishing Foundation, Tehran, Iran.
7. Ababneh, M., Al-Shalabi, R., Kanaan, G., & Al-Nobani, A. (2012). Building an Effective Rule-Based Light Stemmer for Arabic Language to Improve Search Effectiveness. *The International Arab Journal of Information Technology*, 9(4), 368-372.
8. Abu Ata, B., & Al-Omari, A. (2014). A Rule-Based Stemmer for Arabic Gulf Dialect. *Journal of King Saud University, Science*, 50(2), Computer and Information Sciences. DOI: 10.1016/j.jksuci.2014.04.003.
9. Al-Fedaghi, S., & Al-Anzi, F. (1989). A new algorithm to generate Arabic root-pattern forms. In *Proceedings of the 11th National Computer Conference and Exhibition*, 391-400.
10. Aljlayl, M., & Frieder, O. (2002). On Arabic search: Improving the retrieval effectiveness via a light stemming approach. In *Proceedings of the Eleventh International Conference on Information and Knowledge Management*, McLean, VA.
11. Al-Kabi, M. N., Al-Radaideh, Q. A., & Akkawi, K. W. (2011). Benchmarking and assessing the performance of Arabic stemmers. *Journal of Information Science*, 37(2), 111-119.
12. Al-Kabi, M. N., Kazakzeh, S. A., Abu Ata, B. M., Al-Rababah, S. A., & Alsmad, I. M. (2015). A novel root-based Arabic stemmer. *Journal of King Saud University – Computer and Information Sciences*, 27, 94-103.
13. Al-Serhan, H., & Ayesh, A. (2006). A triliteral word roots extraction using neural networks for Arabic. In *The 2006 International Conference on Computer Engineering and Systems*, 436–440.
14. Al-Shalabi, R., Kanaan, G., Ghwanmeh, S., & Nour, F. M. (2007). Stemmer algorithm for Arabic words based on excessive letter locations. In *4th International Conference on Innovations in Information Technology (IIT ‘07)*, 456–460.
15. Boubas, A., Lulu, L., Belkhouche, B., & Harous, S. (2011). GENESTEM: A novel approach for an Arabic stemmer using genetic algorithms. In *International Conference on Innovations in Information Technology (IIT 2011)*, 77–82.
16. Boudchiche, M., & Mazroui, A. (2018). A hybrid approach for Arabic lemmatization. *International Journal of Speech Technology*. https://doi.org/10.1007/s10772-018-9528-3
17. Boudlal, A., Lakhouaja, A., Mazroui, A., Meziane, A., Ould Abdallahi Ould Bebah, M., & Shoul, M. (2010). Alkhalil Morpho SYS1: A morphosyntactic analysis system for Arabic texts. In *International Arab Conference on Information Technology*, Benghazi, Libya, 1–6.
18. Buckwalter, T. (2007). Issues in Morphological Analysis. In A. Soudi, A. van den Bosch, & G. Neumann (Eds.), *Arabic Computational Morphology* (pp. 23–41). Springer.
19. Chen, A., & Gey, F. C. (2002). Building an Arabic stemmer for information retrieval. In *Proceedings of the 11th Text Retrieval Conference (TREC)*.
20. Darwish, K. (2003). Probabilistic methods for searching OCR-degraded Arabic text. Unpublished Ph.D. thesis, University of Maryland, USA.
21. El-Sadany, T. A., & Hashish, M. A. (1989). An Arabic Morphological System. *IBM System Journal*, 28(4), 600-612.
22. Flores, F. N., & Moreira, V. P. (2016). Assessing the impact of stemming accuracy on information retrieval. *Information Processing & Management*, 52(5), 840-854.
23. Frakes, W. B., & Fox, C. J. (2003). Strength and similarity of affix removal stemming algorithms. *ACM SIGIR Forum*, 37(1), 26-30.
24. Galvez, C. F., Anegón de Moya, & Solana, V. H. (2005). Term conflation methods in information retrieval: Non-linguistic and linguistic approaches. *Journal of Documentation*, 61(4), 520-547.
25. Goweder, A., Poesio, M., De Roeck, A., & Reynolds, J. (2005). Identifying broken plurals in unvowelised Arabic text. In *Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing*, Vancouver, 246-253.
26. Hadni, M., El Ouatik, S. A., & Lachkar, A. (2013). Effective Arabic Stemmer Based Hybrid Approach for Arabic Text Categorization. *International Journal of Data Mining & Knowledge Management Process*, 3(4), 1-14.
27. Hamdy, M. (2017). Build Fast and Accurate Lemmatization for Arabic. In *Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018)*, Miyazaki, Japan.
28. Yaghi, J., & Yagi, S. M. (2004). Systematic Verb Stem Generation for Arabic. In *Proc. of the Workshop on Computational Approaches to Arabic Script-Based Languages*, Geneva, Switzerland, 23–30.
29. Kazem, T., Rania, E., & Je.rey, C. (2005). Arabic Stemming Without A Root Dictionary. In *International Conference on Information Technology: Coding and Computing (ITCC'05)* - Volume II, USA. DOI: 10.1109/ITCC.2005.90
30. Khoja, S., & Garside, R. (1999). Stemming Arabic Text. Technical report, Computing Department, Lancaster University, UK.
31. Larkey, L., Ballesteros, L., & Connell, M. E. (2007). Light stemming for Arabic information retrieval. In *Arabic Computational Morphology: Knowledge-based and Empirical Methods* (pp. 221-243). Springer.
32. Larkey, L., Ballesteros, L., & Connell, M. E. (2002). Improving stemming for Arabic information retrieval: Light stemming and co-occurrence analysis. In *SIGIR’02*, Tampere, Finland, 275-282.
33. Lovins, J. B. (1968). Development of a Stemming Algorithm. *Mechanical Translation and Computational Linguistics*, 11(1-2), 22-31.
34. Mustafa, M., Afag, S. E., Bani-Ahmad, S., & Abdelrahman, O. E. (2017). A Comparative Survey on Arabic Stemming: Approaches and Challenges. *Intelligent Information Management*, 9, 39-67.
35. Mustafa, S. H. (2012). Word Stemming for Arabic Information Retrieval: The Case for Simple Light Stemming. *Abhath Al-Yarmouk: Basic Sci. & Eng.*, 21(1), 123-144.
36. Namly, D., Tajmout, R., Bouzoubaa, K., & Abouenour, L. (2016). NAFIS: A Gold Standard Corpus for Arabic Stemmers Evaluation. In *Proc. of the 28th International Business Information Management Association Conference IBIMA*, Seville, Spain. ISBN: 978-0-9860419-8-3.
37. Pasha, A., Al-Badrashiny, M., Diab,
 M., El Kholy, A., Eskander, R., Habash, N., Pooleery, M., Rambow, O., & Roth, R. (2014). MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic. In *LREC*, vol 14, 1094-1101.
38. Rogati, M., McCarley, S., & Yang, Y. (2003). Unsupervised learning of Arabic stemming using a parallel corpus. In *Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1*, Stroudsburg, USA.
39. Saad, M. K., & Ashour, W. (2010). Arabic morphological tools for text mining. In *6th International Conference on Electrical and Computer Systems (EECS’10)*, Lefke, North Cyprus.
40. Sawalha, M., & Atwell, E. S. (2008). Comparative evaluation of Arabic language morphological analysers and stemmers. In *Proceedings of COLING 2008 22nd International Conference on Computational Linguistics. Companion Volume: Posters and Demonstrations*, Manchester, 107-110.
41. Sembok, T. M. T., & Abu Ata, B. (2013). Arabic word stemming algorithms and retrieval effectiveness. In *Lecture Notes in Engineering and Computer Science*. Vol. 3 LNECS, London, 1577-1582.
42. Frakes, W. B., & Fox, C. J. (2003). Strength and similarity of affix removal stemming algorithms. *ACM SIGIR Forum*, 37(1), 26-30. DOI: 10.1145/945546.945548.
43. Younes, J., Namly, D., Bouzoubaa, K., & Yousfi, A. (2017). Enhancing Arabic stemming process using resources and benchmarking tools. *Journal of King Saud University - Computer and Information Sciences*, 29(2), 164-170.
44. Yusof, R. J. R., Zainuddin, R., Mohd Sapiyan Baba, & Zulkifli, M. Y. (2010). QUR'ANIC WORDS STEMMING. *Arabian Journal for Science and Engineering*, 35(2C), 37-49.
45. Obeid, O., Zalmout, N., Khalifa, S., Taji, D., Oudah, M., Alhafni, B., Inoue, G., Eryani, F., & Habash, N. (2020). An Open Source Python Toolkit for Arabic Natural Language Processing. In *European Language Resources Association (ELRA)*, 7022-7032.