Measuring the Information Content of Farsi Scientific Texts Based on Information Theory of Entropy

Document Type : Research Article

Authors

ferdowsi University of Mashhad

Abstract

Purpose: This study aimed to measure the information load of words in Farsi scientific texts and determine the relationship between some properties of the words and information load of the texts based on Shannon entropy.
Methodology: The study was conducted based on the content analysis of a number of 320 articles that were published in scientific-research Iranian journals in 2009. And the articles were chosen through random sampling method.
Findings: The entropy analysis indicated that there is a relationship between word frequency, word status, word length, and information load of the texts. The findings also showed a significant difference between information loads of the texts in different scientific areas.

Keywords


بریس، نیکلا، کمپ ریچارد، و سنلگار، رزمری (1391). تحلیل داده‌های روانشناسی با برنامه اس پی اس اس. ترجمه خدیجه علی‌آبادی و علی صمدی. تهران: دوران
حری، عباس (1381). دائره‌المعارف کتابداری و اطلاع‌رسانی. تهران: مرکز اسناد و کتابخانه ملی ایران
داورپناه، محمدرضا، و بلندیان، صدیقه (1386). تحلیل متن مقالات فارسی و امکان نمایه سازی ماشینی آن‌ها براساس قانون زیف. فصلنامه پژوهش در مسائل تعلیم و تربیت: ویژه نامه کتابداری و اطلاع رسانی، دور دوم
سنجی، مجیده، و داورپناه، محمدرضا (1388). شناسایی واژه‌های غیرمفهومی (رایج) در نمایه سازی خودکار مدارک فارسی. فصلنامه کتابداری و اطلاع رسانی، 12(48)، 23- 35
میرزایی، اردوان. (1385). حشو در زبان با رویکرد نظریه اطلاعات. پیک نور، 4(4)، 40-48
وحیدیان کامیار، تقی (1379). دستور زبان فارسی1. تهران: سمت
هاشمی، محسن، و ساوجی، محمدحسن (1386). فشرده سازی متن فارسی با استفاده از الگوریتم‌های حسابی و هافمن و مقایسه آن با فشرده سازی متن انگلیسی. مجموعه مقالات پانزدهمین کنفرانس مهندسی برق ایران. تهران
هویدا، علیرضا (1378). آمار و روش‌های کمی در کتابداری و اطلاع‌رسانی. تهران: سمت.
Bartlett, M. S. (2007). Information Maximization in Face Processing. Neurocomputing, 70(13–15), 2204-2217.
Cancho, R., & Martin, F. M. d. P. (2011).Information content versus word length in random typing. Journal of Statistical Mechanics: Theory and Experiment, 2011(12), L12002.
Caraballo, S. A., & Charniak, E. (1999). Determining The Specificity Of Nouns From Text. Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora.
Fragos, K., Maistros, Y., & Skourlas, C. (2005).A weighted maximum entropy language model for text classification. NLUCS, 55-67.
Frisson, S., Rayner, K., & Pickering, M. J. (2005). Effects of contextual predictability and transitional probability on eye movements during reading. JExpPsychol Learn MemCogn, 31(5), 862-877.
Genzel, D., & Charniak, E. (2002).Entropy rate constancy in text. Paper presented at the Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, Pennsylvania.
Hegazi, N., Ali, N., & Abed, E. (1987). Information content in textual data: Revisited for Arabic text. Journal of the American Society for Information Science, 38(2), 133-137.
Kireyev, K. (2009). Semantic-Based Estimation of Term Informativeness. The 2009 Annual Conference of the North American Chapter of the ACL. Boulder, Colorado
Lin, L., & Yu-Shu, L. (2005). Research and realization of naive Bayes English text classification method based on base noun phrase identification. Paper presented at ITI 3rd International Conference, Cairo.
Luhn, H. P. (1957). A statistical approach to mechanized encoding and searching of literary information. IBM Journal of Research and Development, 1, 309-317.
MacKay, D. J. C. (2003). Information theory, inference and learning algorithms.Cambridge: CambridgeUniversity Press.
Manning, C. D. & Schutze, H. (1999).Foundations of Statistical Natural Language Processing, Cambridge:MIT Press.
Melamed, I.D. (1997). Measuring Semantic Entropy. Proceedings of the SIGLEX Workshop on Tagging Text with Lexical Semantics.
Montemurro, M. A. & Zanette, D. H. (2001).Entropic analysis of the role of words in literary texts. Advances in Complex Systems.5(1), 7-17.
Nemirovsky, D., & Dobrynin, V. (2008).Word importance discrimination using context information. In Proceedings TREC
Ryu, P. M., & Choi, K. S. (2004). Determining the Specificity of Terms based on Information Theoretic Measures. Paper presented at Poster Session of 3rd International Workshop on Computational Terminology. Daejeon, Korea
Tomokiyo, T., & Hurst, M. (2003). A language model approach to keyphrase extraction. Paper presented at the Proceedings of the ACL 2003 workshop on Multiword expressions: analysis, acquisition and treatment - Volume 18, Sapporo, Japan.
Weber, P. R. (1990). Basic content analysis. Newbury Park, Calif.
Yu, L.-C., Wu, C.-H., & Hovy, E. (2008). OntoNotes: corpus cleanup of mistaken agreement using word sense disambiguation. Paper presented at the Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1, Manchester, United Kingdom.
Zou F., Wang F.L., Deng X., & Han S. (2006), Automatic Identification of Chinese Stop Words. A special issue on Advances in Natural Language Processing of the journal Research on Computing Science (p 151-162).
CAPTCHA Image