Language varieties

Abstract
Part-of-speech tagging, like any supervised statistical NLP task, is more difficult when test sets are very different from training sets, for example when tagging across genres or language varieties. We examined the problem of POS tagging of different varieties of Mandarin Chinese (PRC-Mainland, PRC- Hong Kong, and Taiwan). An analytic study first showed that unknown words were a major source of difficulty in cross-variety tagging. Unknown words in English tend to be proper nouns. By contrast, we found that Mandarin unknown words were mostly common nouns and verbs. We showed these results are caused by the high frequency of morphological compounding in Mandarin; in this sense Mandarin is more like German than English. Based on this analysis, we propose a variety of new morphological unknown-word features for POS tagging, extending earlier work by others on unknown-word tagging in English and German. Our features were implemented in a maximum entropy Markov model. Our system achieves state-of-the-art performance in Mandarin tagging, including improving unknown-word tagging performance on unseen varieties in Chinese Treebank 5.0 from 61% to 80% correct.
Keywords No keywords specified (fix it)
Categories No categories specified
(categorize this paper)
Options
 Save to my reading list
Follow the author(s)
My bibliography
Export citation
Find it on Scholar
Edit this record
Mark as duplicate
Revision history Request removal from index Translate to english
 
Download options
PhilPapers Archive


Upload a copy of this paper     Check publisher's policy on self-archival     Papers currently archived: 11,365
External links
Setup an account with your affiliations in order to access resources via your University's proxy server
Configure custom proxy (use this if your affiliation does not provide a proxy)
Through your library
References found in this work BETA

No references found.

Citations of this work BETA

No citations found.

Similar books and articles
Analytics

Monthly downloads

Sorry, there are not enough data points to plot this chart.

Added to index

2010-12-22

Total downloads

1 ( #445,363 of 1,102,742 )

Recent downloads (6 months)

1 ( #296,833 of 1,102,742 )

How can I increase my downloads?

My notes
Sign in to use this feature


Discussion
Start a new thread
Order:
There  are no threads in this forum
Nothing in this forum yet.