David Bourget (Western Ontario)
David Chalmers (ANU, NYU)
Rafael De Clercq
Jack Alan Reynolds
Learn more about PhilPapers
Words in Semitic texts often consist of a concatenation of word segments, each corresponding to a Part-of-Speech (POS) category. Semitic words may be ambiguous with regard to their segmentation as well as to the POS tags assigned to each segment. When designing POS taggers for Semitic languages, a major architectural decision concerns the choice of the atomic input tokens (terminal symbols). If the tokenization is at the word level the output tags must be complex, and represent both the segmentation of the word and the POS tag assigned to each word segment. If the tokenization is at the segment level, the input itself must encode the different alternative segmentations of the words, while the output consists of standard POS tags. Comparing these two alternatives is not trivial, as the choice between them may have global effects on the grammatical model. Moreover, intermediate levels of tokenization between these two extremes are conceivable, and, as we will aim to show, beneficial. To the best of our knowledge, the problem of tokenization for POS tagging of Semitic languages has not been addressed before in full generality. In this paper, we study this problem for the purpose of POS tagging of Modern Hebrew texts. After extensive error analysis of the two simple tokenization models, we propose a novel, linguistically-motivated, intermediate tokenization model that gives better performance for Hebrew over the two initial architectures. Our study is based on the well-known Hidden Markov Models (HMMs). We start out from a manually devised morphological analyzer and a very small annotated corpus, and describe how to adapt an HMM-based POS tagger for both tokenization architectures. We present an effective technique for smoothing the lexical probabilities using an untagged corpus, and a novel transform for casting the segment-level tagger in terms of a standard, word-level, HMM implementation. The results obtained using our model are on par with the best published results on Modern Standard Arabic, despite the much smaller annotated corpus available for Modern Hebrew..
|Keywords||No keywords specified (fix it)|
No categories specified
(categorize this paper)
|Through your library||Only published papers are available at libraries|
Similar books and articles
Dan Klein & Christopher D. Manning, Conditional Structure Versus Conditional Estimation in NLP Models.
Archana Balyan, S. S. Agrawal & Amita Dev (2012). Automatic Phonetic Segmentation of Hindi Speech Using Hidden Markov Model. AI and Society 27 (4):543-549.
Edoardo Zamuner, Fabio Tamburini & Cristiana de Sanctis (2002). “Identifying Phrasal Connectives in Italian Using Quantitative Methods”. In Stefania Nuccorini (ed.), Phrases and Phraseology – Data and Descriptions. Peter Lang Verlag.
Edward G. Belaga (2008). Fine -Tuning the Blueprint of the Verbal Structure of Biblical Hebrew. In Gerda Hassler (ed.), Proceedings of The 11th International Conference on the History of the Language Sciences, ICHoLS XI will take place at the University of Potsdam, from 28 August to 2 September 2008. Leipzig.
Yoram Hazony (2012). The Philosophy of Hebrew Scripture: An Introduction. Cambridge University Press.
Robert Daland & Janet B. Pierrehumbert (2011). Learning Diphone-Based Segmentation. Cognitive Science 35 (1):119-155.
Sorry, there are not enough data points to plot this chart.
Added to index2009-01-28
Total downloads1 ( #301,668 of 1,008,710 )
Recent downloads (6 months)1 ( #64,702 of 1,008,710 )
How can I increase my downloads?