Abstract

Named entity recognition (NER) is a subtask in natural language processing, and its accuracy greatly affects the effectiveness of downstream tasks. Aiming at the problem of insufficient expression of potential Chinese features in named entity recognition tasks, this paper proposes a multifeature adaptive fusion Chinese named entity recognition (MAF-CNER) model. The model uses bidirectional long short-term memory (BiLSTM) neural network to extract stroke and radical features and adopts a weighted concatenation method to fuse two sets of features adaptively. This method can better integrate the two sets of features, thereby improving the model entity recognition ability. In order to fully test the entity recognition performance of this model, we compared the basic model and other mainstream models on Microsoft Research Asia (MSRA) and “China People’s Daily” dataset from January to June 1998. Experimental results show that this model is better than other models, with F1 values of 97.01% and 96.78%, respectively.

1. Introduction

Word representation learning has been widely concerned as a basic problem in the field of natural language processing. Unlike traditional one-hot representations, low-dimensional distributed vocabulary representations (also called word embeddings) represent words as low-dimensional dense real number vectors, which can better capture the associated information between natural language words. This form of representation is very useful in some downstream tasks of natural language processing, for example, text classification [1], NER [2, 3], relation extraction [4, 5], and sentiment analysis [6, 7]. Therefore, how to obtain a better semantic representation of words is crucial.

In recent years of research, the NER model is mainly based on deep learning, and with the development of deep learning, more and more remarkable results have been achieved. The main basic model framework of English NER is Bidirectional Long Short-Term Memory-Conditional Random Field (BiLSTM-CRF) [8], which uses word embedding as the basic unit of predicting labels. English is a kind of phonetic alphabet, but Chinese characters represent typical meanings, so these research methods cannot be directly applied to Chinese. Unlike English, Chinese sentences do not have as obvious separators as in English. Therefore, when processing Chinese NER tasks, first use word segmentation tools to segment sentences, and then implement a sequence tagging model based on segmentation words. This method results in poor performance of CNER because CNER faces the following difficulties: (1) the quality of sentence segmentation has a great impact on the performance of NER. For example, “武汉市长江大桥” (Wuhan Yangtze River Bridge) as a whole location named entity, after segmentation by word segmentation tools, it may be segmented into “武汉” (Wuhan), “市长” (mayor), and “江大桥” (river's bridge). When these participles are used as input to the NER model, they will be recognized as three different named entities. (2) In order to solve the problem of word-level embedding, character-level embedding is widely used in NER tasks, but it still has many shortcomings. As shown in Figure 1, “人” (people), “八” (eight), and “乂” (yi) are semantically unrelated, and the stroke sequence is “丿” (left-falling stroke) and “ɭ” (right-falling stroke). Chinese characters with the same stroke sequence have completely different semantics, and the stroke sequence of a Chinese character cannot uniquely identify a Chinese character. Similarly, using the radical feature alone will also encounter the same problem. In order to solve this problem, we can introduce another internal characteristic of Chinese characters—the roots; the radicals of “人”, “八”, and “乂” are “人”, “八” , and “丿” , respectively. Combining the two characteristics of this stroke and the radical can distinguish Chinese characters well. Figure 1

Integrating the internal characteristics of Chinese characters is effective for learning Chinese word embedding [9]. For example, Yin et al. used Convolutional Neural Networks (CNNs) to extract radical features, aiming to capture the intrinsic and intrinsic correlation of characters. Experimental results show that the model has achieved good performance in the field of Chinese clinical NER [10]. Chinese characters have rich internal structural features. How to better learn and use these features to improve the quality of Chinese character embedding is very important. It can be further studied on how to better combine the character characteristics of Chinese characters with the internal characteristics of Chinese characters. This article designs a multifeature adaptive fusion (MAF) method to fuse the stroke features and radical features of Chinese characters. This method can adaptively calculate the weight of the fusion stroke feature and radical feature. The main contributions of this article can be summarized as follows:(1)This article integrates the characteristics of characters, strokes, and radicals into the BiLSTM-CRF model to fully represent the semantic information of Chinese characters.(2)To achieve a better and more balanced fusion of the two sets of features, this article adopts the method of weighted series adaptive fusion features.(3)Evaluation results show that this model achieves good performance on both the MSRA dataset and the 1998 China People’s Daily dataset.

The remainder of this article is structured as follows. Section 2 introduces the related work of CNER. Section 3 gives a detailed description of the MAF-CNER model. Section 4 presents extensive experiments to verify the effectiveness of our proposal, and Section 5 summarizes this work.

The traditional solution to the NER problem mainly includes three methods: rule-based method, statistics-based method, and dictionary-based method [11]. The method based on rules and dictionaries requires professional linguists to write rules by hand, requires a lot of time, and has poor portability in different fields. In the task of NER, statistical methods mainly use Conditional Random Field (CRF) and Hidden Markov Model (HMM) [12, 13]. Although the accuracy rate is improved compared with the method based on rules and dictionaries, it still has disadvantages such as long training time.

With the continuous development of deep learning, researchers began to apply deep learning to NER tasks. Compared with traditional models, neural network models can learn deeper semantic feature information with almost no need for feature engineering [14] and domain knowledge. These models further improve the accuracy of entity recognition, especially the BiLSTM-CRF model [15, 16], and can significantly improve the performance of NER tasks.

The standard model for solving NER problems in the English domain is the BiLSTM-CRF model proposed by Huang et al. [17], which is more robust and less dependent on word embedding. Based on this structure, Lample et al. proposed to use BiLSTM to extract word representations on character-level embedding. Cho et al. proposed a deep learning NER model that effectively represents biomedical word tokens through the design of a combinatorial feature embedding, enhanced by integrating two different character-level representations extracted from CNN and BiLSTM [18]. In the Chinese field, CNER is more challenging [19]. Wang et al. proposed a CNN model based on a gating mechanism (GCNN) [20]. Cao et al. used Chinese character strokes as features and proposed the stroke n-gram model, which not only excavated the feature information of Chinese character strokes but also more effectively used the semantic information of Chinese characters to train word vectors [21]. Cao et al. proposed a novel adversarial transfer learning framework to make full use of the boundary information shared by tasks and prevent the task-specific functions of Chinese word segmentation [22]. Xu et al. proposed a simple and effective neural network framework ME-CNER (Multiple Embeddings for Chinese Named Entity Recognition), which embeds rich semantic information at multiple levels from radicals, characters to words [23]. Wu et al. proposed a radical-based CNER RCBC (R-CNN-BiLSTM-CRF). The RCBC-based model uses CNNs to automatically extract the semantics of the radicals of Chinese characters and combines the word vectors and radical vectors into joint vector. This method can reduce the semantic deviation of radical features and capture semantic information more accurately [24]. Ye et al. proposed a CNER model based on character-word vector fusion. This model reduces the dependence on the accuracy of word segmentation algorithms and effectively utilizes the semantic features of words [25]. In order to solve the ambiguity of Chinese words and the lack of word boundaries, Wu et al. proposed a novel fine-grained character-level representation method to capture the semantic information of Chinese characters [26]. Although the above methods have achieved good results, none of them have a more in-depth exploration of the internal characteristics of Chinese characters, and the fusion methods between multiple characteristics can be studied more deeply.

3. MAF-CNER Model

This section introduces the network layer organization structure of the “multifeature adaptive fusion Chinese named entity recognition model” model, as shown in Figure 2. The model is divided into three layers: character, stroke, and radical multifeature vector fusion layer; BiLSTM layer; CRF layer. The radical and stroke feature representations are calculated by the BiLSTM neural network, merged using the weighted concatenation method and concatenated with the character vector to form the final input vector. BiLSTM extracts the context features of the current input vector. The input of the CRF layer is the output vector of the BiLSTM layer, and the CRF layer will decode the information and obtain the best tag sequence. We will introduce the components of the Chinese clinial NER model based on MAF from bottom to top, as follows.

3.1. Character, Stroke, and Radical Multifeature Vector Fusion Layer

For a given sentence sequence, the embedding vector is composed of Chinese characters. Character characteristics are , radical characteristics, , and stroke characteristics, , respectively. As shown in Figure 3, the embedding vector of each character can be expressed as follows:

3.2. Character Embedding

Character-level embedding has been widely used in natural language processing. Research shows that embedding pretrained characters in a specific field can improve system performance. For example, adding character-level features in neural machine translation [27, 28] can improve the translation performance of the system, text classification [29, 30], and NER also uses character-level representation. Therefore, the pretrained character embedding is better than the random initial character embedding. This article uses the Chinese Wikipedia corpus of May 2020 to pretrain Chinese character embedding through Word2Vec. After preprocessing, about 171M training corpus is finally obtained. The pretraining of character embedding is implemented with the Python version of Word2Vec in Gensim, and the dimension of the feature vector is set to 100.

3.3. Adaptive Fusion Representation of Strokes and Radical Features
3.3.1. Radical Features

A Chinese character is a kind of pictograph, and the radical is the first stroke or shape of a Chinese character. One of the most notable

features of Chinese characters is that they contain a lot of semantic information at the radical level. The radicals of Chinese characters have a very important impact on the semantics of Chinese characters. For example, “胖” (fat) , “胸” (chest), and “肺” (lung). The main radical “月” (moon) is a simplified form of “肉” (flesh), which stands for meat, indicating that these characters are related to organs. A total of 228 radicals such as “鹿” (deer), “卤” (halogen), and “丶” (dot) are numbered from 1 to 228. However, the research in the traditional model mainly focuses on the semantic research at the phrase level.

This article uses the BiLSTM network to extract the semantic information of the corresponding radicals of Chinese characters. Figure 4 shows the overall structure of the model in detail. The expression is as follows:

In formula 2, is the hidden layer vector obtained by training BiLSTM network.

3.3.2. Stroke Characteristics

Stroke usually refers to the uninterrupted dots and lines of various shapes that compose Chinese characters, such as horizontal (“一”), vertical (“丨”), and left-falling stroke (“丿”) and dot (“丶”). It is the smallest continuous stroke unit that constitutes a Chinese character. As shown in Table 1, we divide the strokes into five types with the corresponding numbers of 1 to 5.

The Chinese character writing system provides a guide for the stroke order of each Chinese character. With this stroke information, we can decompose Chinese characters into strokes in a specific stroke order. This sequence information can be used when learning the internal semantic information of Chinese characters. Therefore, this article uses the BiLSTM network to extract the contextual semantic information of Chinese character strokes. Figure 4 shows the model structure. This method can learn more Chinese character graphic features. The expression is as follows:where is the -th stroke feature vector of the -th Chinese character.

3.3.3. Adaptive Feature Fusion

As shown in Figure 4, this article takes the stroke feature as the main feature, calculates its similarity with the character vector obtained by Word2Vec training, and determines its weight m according to formula 4.where is a character vector and is a stroke vector.

The radical feature is used as an auxiliary feature, and the importance of the radical itself is calculated according to formulas 5 and (6), and its weight n is determined, and the weighted series method is used to fuse the two sets of features. This method can not only learn more graphic features of Chinese characters but also make the combination of the two features more balanced.

In formula 5, and b are trainable parameters.

The features of adaptive fusion are expressed as follows:

3.4. BiLSTM Layer

Long Short-Term Memory (LSTM) is a variant of Recurrent Neural Network (RNN), which has been widely used in many natural language processing (NLP) tasks, such as NER, text classification, and sentiment analysis. It introduces the cell state and uses input gates, forget gates, and output gates to maintain and control information, which can effectively overcome the gradient explosion and gradient loss caused by the long-distance dependence of the RNN model. The mathematical expression of the LSTM model is as follows:where represents the sigmod activation function. tanh represents the hyperbolic tangent function. represents unit input. , , represent the input gate, forget gate, and output gate at time and , respectively, which represent the weight and deviation of the input gate, forget gate, and output gate. represents the current state of the input. represents the update status at that is the output at t.

In order to use character context information at the same time, the model in this article uses BiLSTM to get the context vector of each character, which is a combination of forward LSTM and reverse LSTM. For a given sentence x = (x1, x2,…,xn), we use to represent the hidden layer state of the forward LSTM at time t, whereas represents the reverse LSTM at time t . Hidden layer state, by linking the corresponding forward and reverse LSTM states, gets the final context vector .

3.5. CRF Layer

Compared with the HMM, CRF does not have the strict requirements of the independence assumption of HMM and can effectively use both the internal information of the sequence and the external observation information, avoiding the problem of labeling bias and directly assuming the possibility of labeling and performing differentiated modeling. CRF can capture more dependencies: for example, “I-LOC” tags cannot follow “B-PER” [20]. In CNER, the input of CRF is the context feature vector learned from the BiLSTM layer. For input text sentence,

Let denote the probability score of the -th label of the -th Chinese character in the sentence. For a prediction sequence , the CRF score can be defined as follows:where is the transition matrix and represents the transition score from label to . and represent the start and end tags, respectively. Finally, we use softmax function to calculate the probability of the sequence as follows:

During the training process, maximize the log probability of the correct label sequence:

In the decoding stage, we predict that the maximum score obtained by the output sequence is as follows:

In the prediction stage, the dynamic programming algorithm, Viterbi, is used to solve the optimal sequence.

4. Experiments and Results

4.1. Experimental Data and Evaluation Indicators

In order to evaluate the model proposed in this article on the task of CNER, this article conducted experiments on two different widely used datasets, namely, the MSRA dataset and the “China People’s Daily” dataset from January to June 1998. Table 2 shows the statistical information of the data set used in this article.

4.1.1. MSRA

It is a general dataset for CNER. The dataset contains three named entities: PER (person), LOC (location), and ORG (organization). The training set contains 46364 sentences, and the test set contains 4365 sentences. This article uses the ternary tag set {B, I, O} to mark, B represents the first word of the entity, I represents the remaining words of the entity, and O represents the nonentity word.

4.1.2. China People’s Daily

The China People’s Daily corpus was released by the Institute of Computational Linguistics of Peking University from January to June 1998. The entity categories are PER (person), LOC (location), and ORG (organization), also using the ternary tag set {B, I, O} for labeling. This article uses the data from January to May 1998 as the training set and the validation set. The validation set is 1/5 of the total data from January to May. The data from June 1998 is used as the test set.

In order to fully evaluate the performance of the model, we use Precision (P), Recall (R), and harmonic average F1-score (F1) as the evaluation criteria for model performance, which is defined as follows:

TP (True Positive) indicates the correct number of samples in the positive examples, FP (False Positive) indicates the number of incorrect samples in the negative examples, and FN (False Negative) indicates the number of incorrect samples in the positive examples.

4.2. Model Building and Parameter Setting

The model in this article is built using PyTorch. PyTorch was launched by the Facebook Artificial Intelligence Research Institute (FAIR) in January 2017 based on Torch and is widely used in applications such as NLP. The experimental parameters are set as follows: embedding dimension (embedding_dim) is 300, input dimension max_length is 80, and training set batch_size: China People's Daily dataset is 100, MSRA dataset is 128, and MSRA dataset is 64. The training learning rate is set to 0.001, in order to prevent overfitting during training; the weight decay factor weight_decay is set to , dropout technology is used to prevent overfitting, and the value is set to 0.5.

4.3. Experimental Results

In order to more objectively evaluate the model performance of this model on the MSRA dataset and the “China People’s Daily” dataset, LSTM, BiLSTM, and BiLSTM-CRF models are used for performance analysis. The experimental results are shown in Table 3.

In Table 3, from the comparison of the experimental results of LSTM and BiLSTM, it can be seen that the latter performs better than the former. This verifies that the BiLSTM network can better capture the context information of the serialized text, with stronger learning ability that is better than LSTM . In the comparison between BiLSTM and BiLSTM-CRF, after adding the CRF module, it can be seen that the BiLSTM-CRF model has various aspects. Both are better than BiLSTM, which is mainly due to the fact that CRF considers the global label information in the sequence during the decoding process, which improves the performance of the model. Our model introduces two features of strokes and radicals on the basis of character-level embedding, and the test results on the two datasets achieve the best performance.

In order to verify the effectiveness of this method, it is compared with other mainstream NER methods. The specific results are shown in Tables 4 and 5. In Table 4, Chen et al. used CRF based on character features, and the F1 value was 86.20% [31]. The model of Zhou et al. used a multistage model. They used a character-level CRF model to segment the sequence. Then, word-level CRF layer was used to identify named entities, and the F1 value on the MSRA dataset reached 86.51% [32]. Zhou et al. took CNER as a joint recognition and classification task based on a global linear model [33]. The model used the rich manual feature model proposed in the literature [41] to greatly improve the performance of CNER. The F1 value of another BiLSTM-CRF neural network model proposed by Dong et al. was close to 90.95%. This model used both character-level and radical-level representations in the input of the model structure [34]. Zhang et al. used a lattice LSTM model for CNER. This model encodes the input character sequence and all possible words matching the dictionary. The F1 value of the model reached 93.18%, but the authors did not use the development dataset and trained the lattice LSTM mode [35]. Zhao et al. used a pretrained language model to encode the input sequence as a contextual representation and designed a new model that combines neural networks with BERT; the F1 value of the model reaches 95.28% [36]. However, using the model in this article, the F1 value reaches 97.01%. Johnson proposed the comprehensive embedding, which can take character, word, and position into account, has a valid structure, and can seize effective information. Regarding the test performance on MSRA dataset, F1 value reached 92.99%. Compared with the above model, this model has the best performance.

Table 5 shows test model performance using China People’s Daily dataset. Collobert et al. used a feedforward neural network, combined with preprocessing, affixes, and capitalization features, and achieved a result of 88.50% F1 [38]; Lample et al. input character-level word vectors into the BiLSTM-CRF model and achieved F1 value of 90.08% [19]; Chiu et al. combined BiLSTM with the CNN model and achieved 91.49% of advanced results [39]. Shen et al. proved that when deep learning is combined with active learning, the amount of labeled training data can be reduced. Although active learning can improve sample efficiency, it may be computationally expensive due to iterative retraining. In order to speed up the introduction of a lightweight architecture, the CNN-CNN-LSTM model consists of a convolutional character and word encoder and a LSTM tag decoder [40]. This article uses BiLSTM-CRF as the basic model and introduces two kinds of internal semantic information of Chinese character strokes and radicals. Model performance F1 increased to 96.78%.

5. Conclusion

In view of the insufficient representation of potential features of Chinese characters, this article uses BiLSTM network to learn the internal strokes and radical semantic information of Chinese characters and combines with the BiLSTM-CRF model to construct an adaptive multifeature fusion embedded CNER model. The assessment was conducted on the MSRA corpus and the corpus of China People’s Daily” from Janu”ary to June 1998. Compared with other mainstream methods, the model in this article achieves the best results on both corpora. The biggest advantage of this model is that the weighted concatenation method is used to adaptively fuse two kinds of semantic information in Chinese characters, while previous research only stayed at the word-level embedding or used one kind of internal characteristic semantic information of Chinese characters. This will make the embedding layer insufficiently represented, the performance of the model will be relatively reduced, and the named entity cannot be correctly identified. Combining the two internal features can make Chinese character features more fully represented, avoiding the problem where a single feature cannot correctly distinguish Chinese characters, and the proportion of the two semantic information combinations is more balanced through weighting, and the best combination effect is achieved.

Data Availability

The datasets in the article could be downloaded from the URL: https://github.com/zhooufeng/Data.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The research was supported by the National Science Foundation of China under Grant Nos. 61572225 and 61472049, the Foundation of Jilin Provincial Education Department under Grant No. JJKH20190724KJ, the Jilin Province Science & Technology Department Foundation under Grant Nos. 20190302071GX and 20200201164JC, and the Development and Reform Commission Foundation of Jilin province under Grant No. 2019C053-11.