Metadata Extraction from Persian Scientific Papers Using CRF Model

Document Type : Research Article

Authors

Imam Khomeini International University

Abstract

INTRODUCTION:
Metadata extraction from scientific papers is costly and time consuming. Different layouts and styles of papers increase the complexity of problem. Therefore, metadata extraction from scientific papers is a research question and different algorithms can be used to extract them. The purpose of this paper is to present a framework for metadata extraction from Persian scientific papers. CRF model has been used in this paper.
METHODOLOGY:
This paper is an applied research. It aims at presenting a framework for the metadata extraction. This framework includes identifying the header along with English and Persian references. CRF model has been used to extract metadata from header and references. This model can be modified by defining different features. The proposed method is tested over a set of 100 scientific papers taken from different Iranian journals. Compared to Markov in text tagging, this model has a higher accuracy than other models. On the other hand, this model is based on statistics. Extracting metadata while using statistics from papers with different layouts and styles provides better results than the rule based methods. Therefore, using this model is a good solution to this problem.
FINDING:
F measure has been used to evaluate the proposed method. F measure is calculated for each token. Average F-measure is 96.89, 93.87, and 94.75 percent for header metadata, Persian references metadata, and English references metadata, respectively. The results of this paper have been compared with three similar papers in English. The results of the header author are better in English. Abstracts have better results in Persian language research. The analysis of the average references metadata extraction results shows that English researches have a higher accuracy compared to the Persian references metadata extraction results.
CONCLUSIONS:
Reviewing the results shows that CRF model performance is good for extracting metadata. The most accurate metadata is Abstract with F measure of 99.6%. This metadata has a much larger number of tokens than the other metadata. The accuracy of the institute with the F measure is 80.95% lower than the other metadata. There are two reasons why F measure is reduced. First the number of this metadata is smaller than the other metadata in the text corpus. Second, the words used in this metadata are more diverse. In Persian references, the names of cities are used in location and institution metadata. This makes location and institution to be mistakenly identified in some cases. In Persian, the words commonly used in different metadata are more than English. For example, many Iranian names of the individuals are used with other meanings in other metadata. This issue may cause errors. Most of the errors in the metadata extraction are related to tokens that are located on the border of two metadata. Converting scientific papers in PDF format to text format is difficult in many cases and this is one of the limitations of this research. In this paper, a sample of 100 scientific articles was used. Increasing the number and variety of scientific papers for testing can have a positive effect on the results. A set of textual features are used in the CRF tagging algorithms. Changing these features can make the method better.

Keywords


Beel, J., Gipp, B., Shaker, A., & Friedrich, N. (2010). SciPlore Xtract: Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size). Proceedings of the 14th European Conference on Digital Libraries. Glasgow.
Candeias, R. (2011). Metadata Extraction from Scholarly Articles.
cb2bib overview. (2016). Retrieved 2015, from http://www.molspaces.com/d_cb2bib-overview.php.
Councill, I. G., Giles, C. L., & Kan, M. Y. (2008). ParsCit: an Open-source CRF Reference String Parsing Package. In LREC, 8, 661-667.
Giuffrida, G., Sheck, E., & Yang, J. (2000). KnowledgeBased Metadata Extraction from PostScript Files. Proceedings of the fifth ACM conference on Digital libraries (pp. 77-84). San Antonio, TX, USA: ACM.
Guo, Z., & Jin, H. (2011a). A Rule-based Framework of Metadata Extraction from Scientific Papers. 10th International Symposium on Distributed Computing and Applications to Business, Engineering and Science (pp. 400-404). Wuxi: IEEE.
Guo, Z., & Jin, H. (2011b). Reference Metadata Extraction from Scientific Papers. 12th International Conference on Parallel and Distributed Computing, Applications and Technologies (pp. 45-49). Gwangju: IEEE.
Han, H., Giles, C. L., Manavoglu, E., Zha, H., Zhang, Z., & Fox, E. A. (2003). Automatic document metadata extraction using support vector machines. Digital Libraries, 2003. Proceedings. 2003 Joint Conference on (pp. 37-48). IEEE.
Hetzner, E. (2008). A simple method for citation metadata extraction using hidden markov models. In Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries (pp. 280-284). ACM.
Kovacevic, A., Ivanovic, D., Milosavljevic, B., Konjovic, Z., & Surla, D. (2011). Automatic Extraction of Metadata from Scientific Publications for CRIS Systems. Electronic Library and Information Systems, 45 (4), 376-396.
Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. Proceedings of the eighteenth international conference on machine learning. 1, pp. 282-289. USA: Morgan Kaufmann.
ParsCit: Anopen-source CRF Reference String and Logical Document Structure Parsing Package. (2016). Retrieved 2015, from http://aye.comp.nus.edu.sg/parsCit/.
Peng, F., & McCallum, A. (2006). Information extraction from research papers using conditional random fields. Information processing & management, 42 (4), 963-979.
Seymore, K., McCallum, A., & Rosenfeld, R. (1999). Learning hidden Markov model structure for information extraction. AAAI-99 Workshop on Machine Learning for Information Extraction.
Tkaczyk, D., Szostek, P., Dendek, P., Fedoryszak, M., & Bolikowski, L. (2015). CERMINE: automatic extraction of structured metadata from scientific literature. IJDAR, 18 (4), 317-335.
Wallach, H. (2004). Conditional Random Fields: An Introduction. University of Pennsylvania CIS Technical Report.
Zhang, X., Zou, J., Le, D., & Thoma, G. R. (2011). A structural SVM approach for reference parsing. BMC bioinformatics, 12 (3), 479-484.
CAPTCHA Image