Skip to content
Licensed Unlicensed Requires Authentication Published by De Gruyter June 7, 2017

A Novel Word Clustering and Cluster Merging Technique for Named Entity Recognition

  • Rakesh Patra and Sujan Kumar Saha EMAIL logo

Abstract

In this paper, we present a novel word clustering technique to capture contextual similarity among the words. Related word clustering techniques in the literature rely on the statistics of the words collected from a fixed and small word window. For example, the Brown clustering algorithm is based on bigram statistics of the words. However, in the sequential labeling tasks such as named entity recognition (NER), longer context words also carry valuable information. To capture this longer context information, we propose a new word clustering algorithm, which uses parse information of the sentences and a nonfixed word window. This proposed clustering algorithm, named as variable window clustering, performs better than Brown clustering in our experiments. Additionally, to use two different clustering techniques simultaneously in a classifier, we propose a cluster merging technique that performs an output level merging of two sets of clusters. To test the effectiveness of the approaches, we use two different NER data sets, namely, Hindi and BioCreative II Gene Mention Recognition. A baseline NER system is developed using conditional random fields classifier, and then the clusters using individual techniques as well as the merged technique are incorporated to improve the classifier. Experimental results demonstrate that the cluster merging technique is quite promising.

Classification: 91C20; 68T50; 68T30; 62H30

Bibliography

[1] R. K. Ando, BioCreative II Gene Mention tagging system at IBM Watson, in: Proc. Second BioCreative Challenge Evaluation Workshop, pp. 101–103, 2007.Search in Google Scholar

[2] C. Biemann, Chinese whispers — an efficient graph clustering algorithm and its application to natural language processing problems, in: Proc. HLT-NAACL-06 Workshop on Textgraphs-06, 2006.10.3115/1654758.1654774Search in Google Scholar

[3] P. F. Brown, P. V. Desouza, R. L. Mercer, V. J. D. Pietra and J. C. Lai, Class-based n-gram models of natural language, Comput. Linguist.18 (1992), 467–479.Search in Google Scholar

[4] H. L. Chieu and H. T. Ng, Named entity recognition: a maximum entropy approach using global information, in: Proc. 19th Int. Conf. Computational Linguistics, pp. 1–7, 2002.10.3115/1072228.1072253Search in Google Scholar

[5] A. Ekbal and S. Saha, Combining feature selection and classifier ensemble using a multiobjective simulated annealing approach: application to named entity recognition, Soft Comput.17 (2013), 1–16.10.1007/s00500-012-0885-6Search in Google Scholar

[6] A. Ekbal, S. Saha and U. K. Sikdar, On active annotation for named entity recognition, Int. J. Mach. Learn. & Cyber.7 (2016) 623–640.10.1007/s13042-014-0275-8Search in Google Scholar

[7] J. R. Finkel, T. Grenager and C. Manning, Incorporating non-local information into information extraction systems by Gibbs sampling, in: Proc. 43rd Annual Meeting of the ACL, pp. 363–370, 2005.10.3115/1219840.1219885Search in Google Scholar

[8] K. Ganchev, K. Crammer, F. Pereira, G. Mann, K. Bellare, A. McCallum, S. Carroll, Y. Jin and P. White, Penn/UMass/CHOP BioCreative II systems, in: Proc. Second BioCreative Challenge Evaluation Workshop, pp. 119–124, 2007.Search in Google Scholar

[9] Z. GuoDong and S. Jian, Exploring deep knowledge resources in biomedical name recognition, in: Proc. Joint Workshop on NLP in Biomedicine and Its Applications, pp. 96–99, 2004.10.3115/1567594.1567616Search in Google Scholar

[10] X. Han and J. Zhao, Named entity disambiguation by leveraging Wikipedia semantic knowledge, in: Proc. ACM Conf. Information and Knowledge Management, pp. 215–224, 2009.10.1145/1645953.1645983Search in Google Scholar

[11] H. S. Huang, Y. S. Lin, K. T. Lin, C. J. Kuo, Y. M. Chang, B. H. Yang, I. F. Chung and C. N. Hsu, High-recall Gene Mention Recognition by unification of multiple backward parsing models, in: Proc. Second Bio-Creative Challenge Evaluation Workshop, pp. 109–111, 2007.Search in Google Scholar

[12] J. I. Kazama and K. Torisawa, Exploiting Wikipedia as external knowledge for named entity recognition, in: Proc. Joint Conference on EMNLP and CoNLL, pp. 698–707, 2007.Search in Google Scholar

[13] J. Kuo, Y. M. Chang, H. S. Huang, K. T. Lin, B. H. Yang, Y. S. Lin, C. N. Hsu and I. F. Chung, Rich feature set, unification of bidirectional parsing and dictionary filtering for high F-score Gene Mention tagging, in: Proc. BioCreative Challenge Evaluation Workshop, pp. 105–107, 2007.Search in Google Scholar

[14] J. Lafferty, A. McCallum and F. C. Pereira, Conditional random fields: probabilistic models for segmenting and labeling sequence data, in: Proc. International Conference on Machine Learning, pp. 282–289, 2001.Search in Google Scholar

[15] W. Li and A. McCallum, Rapid development of Hindi named entity recognition using conditional random fields and feature induction, ACM Trans. Asian Lang. Inf. Process. (TALIP)2 (2004), 290–294.10.1145/979872.979879Search in Google Scholar

[16] P. Liang, Semi-supervised learning for natural language, Master’s thesis, Massachusetts Institute of Technology, 2005.Search in Google Scholar

[17] Y. Matsuo and K. Uchiyama, Graph-based word clustering using web search engine, in: Proc. EMNLP 2006, pp. 542–550, 2006.10.3115/1610075.1610150Search in Google Scholar

[18] Y. Merhav, F. Mesquita, D. Barbosa, W. G. Yee and O. Frieder. Incorporating global information into named entity recognition systems using relational context, in: Proc. International ACM Conference on Research and Development in Information Retrieval, SIGIR 2010, pp. 883–884, 2010.10.1145/1835449.1835664Search in Google Scholar

[19] S. Miller, J. Guinness and A. Zamanian. Name tagging with word clusters and discriminative training, in: Proc. HLT-NAACL, 2004.Search in Google Scholar

[20] T. Munkhdalai, M. Li, K. Batsuren, H. Park, N. Choi and K. H. Ryu, Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations, J. Cheminf.7 (2015), S9.10.1186/1758-2946-7-S1-S9Search in Google Scholar PubMed PubMed Central

[21] F. Pereira, N. Tishby and L. Lee, Distributional clustering of English words, in: Proc. Annual Meeting of the ACL, pp. 183–190, 1993.10.3115/981574.981598Search in Google Scholar

[22] L. Ratinov and D. Roth, Design challenges and misconceptions in named entity recognition, in: Proc. Thirteenth Conference on Computational Natural Language Learning (CoNLL), pp. 147–155, 2009.10.3115/1596374.1596399Search in Google Scholar

[23] S. K. Saha, P. Mitra and S. Sarkar, A comparative study on feature reduction approaches in Hindi and Bengali named entity recognition, Knowl. Based Syst.27 (2012), 322–332.10.1016/j.knosys.2011.09.015Search in Google Scholar

[24] S. K. Saha, S. Sarkar and P. Mitra, A hybrid feature set based maximum entropy Hindi named entity recognition, in: Proc. Third International Joint Conference on Natural Language Processing (IJCNLP-08), pp. 343–349, 2008.Search in Google Scholar

[25] R. Sasano and S. Kurohashi, Japanese named entity recognition using structural natural language processing, in: Proc. Third International Joint Conference on Natural Language Processing (IJCNLP-08), pp. 607–612, 2008.Search in Google Scholar

[26] A. K. Singh, Named entity recognition for South and South East Asian languages: taking stock, in: Proc. IJCNLP-08 Workshop on NER for South and South East Asian Languages, pp. 5–16, 2008.Search in Google Scholar

[27] L. Smith, L. K. Tanabe, R. J. Ando, C. J. Kuo, I. F. Chung, C. N. Hsu, Y. S. Lin, R. Klinger, C. M. Friedrich, K. Ganchev and M. Torii, Overview of BioCreative II Gene Mention Recognition, Genome Biol.9 (2008), 1–19.10.1186/gb-2008-9-s2-s2Search in Google Scholar PubMed PubMed Central

[28] B. Tang, H. Cao, X. Wang, Q. Chen and H. Xu, Evaluating word representation features in biomedical named entity recognition tasks, BioMed Res. Int.2014 (2014). Article ID 240403, 6, doi: 10.1155/2014/240403.10.1155/2014/240403Search in Google Scholar PubMed PubMed Central

[29] J. Turian, L. Ratinov and Y. Bengio, Word representations: a simple and general method for semi-supervised learning, in: Proc. 48th Annual Meeting of the Association for Computational Linguistics, pp. 384–394, 2010.Search in Google Scholar

[30] A. Ushioda, Hierarchical clustering of words, in: Proc. COLING, pp. 1159–1162, 1996.10.3115/993268.993390Search in Google Scholar

[31] J. Uszkoreit and T. Brants, Distributed word clustering for large scale class-based language modeling in machine translation, in: Proc. ACL-08: HLT, pp. 755–762, 2008.Search in Google Scholar

[32] A. Yeh, More accurate tests for the statistical significance of result differences, in: Proc. COLING 2000.10.3115/992730.992783Search in Google Scholar

Received: 2016-06-09
Published Online: 2017-06-07
Published in Print: 2019-01-28

©2019 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 4.5.2024 from https://www.degruyter.com/document/doi/10.1515/jisys-2016-0074/html
Scroll to top button