An Application and the Adjustment of Zipf Law and Zou Statistical Model in the Recognition of Stop Words in Persian language by utilizing Language Corpus of Articles of scientific research in the field of Library and Information Science

Document Type : Research Article

Authors

Abstract

Purpose: the aim of this research was to recognize and extract a systematic listof Stop Words in order to utilize it in the automatic indexing of Persian texts in the field of Library and Information Science

Method: We used content analysis. The research population was 56 articles from which 20 articles were selected on the basis of simple random sampling.

Findings: Among 15557 words existing in the corpus, according to Zou model in the pre-adjustment list, 1368 words and in the post-adjustment list, 468 words were recognized as stop words. Also according to Zipf law, in the pre-adjustment list, 217 words and in the post-adjustment list, 607 stop words were recognized. The total number of words in the abstract of articles was 1989. In the Zou model, according to pre-adjustment style148 words and according to post-adjustment style173 words were extracted as stop words. Also on the basis of the Zipf law, in pre-adjustment style, 60 words and in post-adjustment style, 186 words were recognized.
In the both applied method there was a direct relation between the frequency of words and probability of being stop words. The highest percentage of stop words (39/44 percent) was attained in the texts of the articles through the application of Zou Statistical Model. The results of this research can lead to increase efficiency of information store and retrieval, decreasing of input and saving in time and expense.

Keywords


بلندیان، صدیقه (1385). تحلیل متن مقالات فارسی کتابداری و اطلاع رسانی و امکان نمایه سازی ماشینی آن ها بر اساس قانون زیف. پایان نامه کارشناسی ارشد، دانشگاه فردوسی مشهد.
تیلور، آرلین(1381). سازماندهی اطلاعات. (محمد حسین دیانی، مترجم). مشهد: کتابخانه رایانه ای .
سنجی، مجیده (1387). شناسایی واژه های غیرمفهومی رایج در نمایه سازی خودکار مدارک فارسی. پایان نامه کارشناسی ارشد، دانشگاه فردوسی مشهد.
گیلوری، عباس(1379). نمایه سازی خودکار: گذشته، حال، آینده. پیام کتابخانه،10(4): 17-25.
هویدا، علیرضا (1378). آمار و روش های کمّی در کتابداری و اطلاع رسانی. تهران: سازمان مطالعه و تدوین کتب علوم انسانی دانشگاه ها (سمت).
Abu-El Khair, I. H. (2003). Effects of Stop Words Elimination for Arabic Information Retrieval. International Journal of Computing & Information Science, 4(3), 119-133. Retrieved June 18, 2010, from http://www.mons.edu.eg
. pcvs/13702/13102.asp
Berg, C. N. (1997). Developing Corpus Specific Stop Word List Using Quantitative Comparison. PhD thesis, Graduate school of Logistics and acquisition management, Retrieved November 20, 2010, from http://www.research.airuniv
.edu/papers/ay1997/afit/ berg cn.pdf
Davarpanah, M. R., Sanji, M., & Aramideh, M. (2009). Farsi Lexical Analy sis and Stop Word List. Library Hi Tech, 27(3), 435-449. Retrieved December 14, 2011 from http://www.emeraldinsight.comType=Article&contentId=1811864
/InsightviewContentItem/do?content
Edmundson, H. P., & Wyllys, R. E. (1959). Automatic Indexing and Abstracting of Contents of Documents. Retrieved June 14 , 2011, from http:// www.
washington.edu
Fox, C. (1990). A stop list for general text. Retrieved November 20, 2010, from http://www.informatik.uni-trier.de/ley/indice/a-tree.pdf
Hao, L., & Hao, Li. (2008). Automatic Identification of Stop Words in Chinese Text Classification. Retrieved October 3, 2011, from http://ieeexplore .ieee.org/xpls,/abs_all.jsp? arnumber =4721858
Kerner, Y. H., & Blitz, S.Y. (2010). Experiments With Extraction of
Stop words in Hebrew. Retrieved April 21, 2012, from http://www. cs.tau.ac.ir/~nachum/iscol/HaCohenKerner_ISCOL 10_2.pdf
Lazarinis, F. (2007). Engineering and Utilizing a Stop Word List in Greek Web. Journal of the American Society for Information Science and Technology, 58(11), 1645-1652. Retrieved November 18, 2011, from http://dl.acm.org/ citation.cfm?id=1285331
Pandey, A. K., & Siddiqui, T. (2009). Evaluation Effect of Stemming and Stop- Word Removal on Hindi Text Retrieval. Retrieved September 17, 2010, from http://www.springerlink.com/index/j6444068.x213572k. pdf Savoy, J. (1999). A Stemming Procedure and Stop Word List for General French Corpora. Journal of the American Society for Information Science, 50(10), 944-952. Retrieved September 17, 2010, from http://www. members.unine.ch
/jacques.savoy/papers/frjasis.pdf
Wilbur, j., & Sirotkn, K (1992). The automatic identification of Stop Word. Journal of Information Science, 18 (1), 45-55. Retrieved September 3, 2010, from http://www. jis.sagepub.com/content/18/1/4
zou, F., Deng, X., & Han, S. (2006). Automatic identification of Chinese Stop Words. Retrieved November 10, 2010 , from http:/ www.cicling. org/2006/
RCS-18/RCS-18-Page151.pdf
CAPTCHA Image