1. Introduction
From a given unprocessed text, the named entity recognition (NER) task seeks to identify and categorize related entities. Named entity recognition has an essential effect in subsequent natural language processing (NLP) projects. These projects include relation extraction [
1], question-answering systems [
2], and entity linking [
3].
The NER aspect of Chinese usually uses character-level annotation strategies to identify named entities [
4]. Several studies have shown that the character-based NER approach avoids errors in the subword stage [
5,
6]. However, sometimes the lexical boundary is the entity boundary; thus, the lack of boundary information provided by the lexicon may cause the wrong entity to be extracted. Take this one, for instance: “南京市长江大桥(Nanjing Yangtze River Bridge)”; if there is no lexical knowledge, some wrong information, such as “南京市长(Mayor of Nanjing)” and “江大桥(Jiang Daqiao)”, may be extracted. Therefore, recent research has focused on improving NER’s performance by better integrating lexical information into characters.
To our knowledge, there exist two primary methodologies for integrating character and lexical information. The first is the dynamic framework method. It designs corresponding structural support for lexical typing, such as Lattice-LSTM [
7], LR-CNN [
8], and FLAT [
9]. Lattice-LSTM extends the commonly used character-based long short-term memory (LSTM) networks to encode character information in sentences while fusing potential word information. The LR-CNN model employs convolutional neural networks (CNNs) to encode both character attributes and probable word features. Additionally, attention mechanisms are utilized to effectively integrate the information from characters and words. However, both RNNs and CNNs have limitations in modeling long-range dependencies [
10]. FLAT overcomes this limitation by designing an ingenious positional encoding to fuse the lattice structure at the top of the Transformer [
10]. As a result, FLAT can interact immediately with all matching words for characters independent of long-range dependencies. Despite the research progress, the above methods still need to improve the specific structure of neural networks, thus limiting the broader application. Another approach is constructing adaptive embedding based on lexical information, i.e., embedding lexical knowledge in the encoding stage. WC-LSTM uses four encoding strategies to encode the Lattice-LSTM input statically [
11]. WC-LSTM, although an adaptive embedding paradigm, suffers from information loss. To incorporate contextual information in the original vector of individual characters, Luo first filters the set of candidate entities for a given character and then constructs a character–entity relationship graph of characters and candidate entities [
12]. The character representations in the character–entity relationship adjacency matrix are updated using graph attention networks (GAT). Finally, a character representation incorporating semantic information of contextual entities is obtained. To better utilize the lexical sources, SoftLexicon directly maps word characters to four positions, begin, middle, end, and single, and then uses a static weighting method to weight the word frequency magnitude in the lexical set [
13]. SoftLexicon has been shown experimentally to effectively address the underutilization of low-speed inference and matching words, compensating for the shortcomings of the lattice-based model [
14]. The unique feature of this approach is that it does not require the development of complex sequence modeling architectures. Therefore, it can be applied to other sequence annotation frameworks.
In addition, some studies have achieved good results without utilizing an external lexicon. Gu found that most types of entities have strong naming regularity. To effectively explore the internal compositional information of entities, a Regularity-Inspired reCOgnition Network (RICON) was designed [
15]. The model utilizes a regularity-aware module to capture the internal regularity of each span. Then, a regularity-agnostic module is employed to mitigate the excessive focus on span regularity. RICON achieves the state-of-the-art performance of the year on the four datasets. Liu utilized BERT pretrained language models to replace traditional static word embeddings [
16,
17]. Employing a context-dependent dynamic generation of semantic vectors improved the representation of word embeddings. It could extract entities more accurately and efficiently than traditional named entity recognition algorithms. It also achieved good results in the named entity recognition task within history and culture. To reduce the dependence on data annotation, Chen developed a new semisupervised model called MAUIL [
18]. Compared with other models, MAUIL cleverly integrates multiple levels of attribute embedding, such as character-level and word-level features. This approach enhances the high-level semantic features in text and dramatically improves the reliability of artificial intelligence programs, such as named entity recognition. In addition, Li proposed a new method called W2NER, which can handle three types of NER tasks: planar entities, overlapping entities, and discontinuous entities in a unified manner [
19]. The NER task is constructively transformed into predictive word–word relation classification. The model structure effectively simulates the adjacency relationships between entity words using next neighbor word (NNW) and trailing head word-* (THW-*) relations. W2NER has driven unified NER to achieve the most advanced performance. These proposed new frameworks bring new ideas to Chinese NER.
According to our findings, most existing studies focus on entity discovery methods. These methods focus more on detecting entity boundaries and only consider words in the thesaurus that match entity characters. However, they ignore the knowledge of the interaction between entity characters and their neighboring matching characters. Fusing lexical information improves the representation for kanji, and this is necessary for Chinese NER. However, information on the entity boundary region is essential for entity detection, and existing lexicon-based methods pay less attention to this region. Our proposed boundary region is the adjacent region’s front and back zones of the entity boundary, as shown in
Figure 1. It is a boundary region of size K, which we call Zone-K.
On the one hand, the lexical semantics of Zone-K helps to improve the understanding of entities and thus to determine their categories. For example,
Figure 2 shows that although “高雄(Kaohsiung)” can be detected in both sentences 1 and 2, it is a challenge to determine its category as “PER” in sentence 1 and “LOC” in sentence 2. This is because there are many polysemous words in Chinese. Thus, even if the boundary of an entity can be detected correctly, determining its category is still a challenge. In this case, we propose considering the semantics of Zone-K characters and their lexical matching words. For example, in sentence 1, the category of “高雄(Kaohsiung)” can be identified as “PER” by “演员(performer)” and “饰演(play)”. In contrast, in sentence 2, the category of “高雄(Kaohsiung)” can be identified as “LOC” by “在(in)” and “住(live)”.
Second, there is a significant semantic change between the characters in Zone-K and the boundary characters of the entity. For the example, in
Figure 3, for the sentence “我曾在高雄住过几天(I stayed in Kaohsiung for a few days)”, since the two character sequences “在(in)-高(gao)” and “雄(xiong)-住(live)” are small probability co-occurrence sequences, we consider “在(in)-高(gao)” and “雄(xiong)- 住(live)” as two semantic violators, which means that the semantic distance between “在(in)” and “高(gao)” as well as “雄(xiong)” and “住(live)” is quite far. Therefore, we consider Zone-K as a semantic mutation zone, which is similar to an image’s contour and reflects the text’s local feature discontinuity. Semantic changes in Zone-K can help determine the entity’s boundary “高雄(Kaohsiung)”.
In summary, we propose to use the Zone-K information in two ways.
One is to fuse the lexical knowledge of Zone-K to help determine the category of entities. For this purpose, we propose to use graph attention networks to catch connections among characters with their neighboring character-matching words. For example, based on semantics of the adjacent contextual match “饰演(play)” for “雄(xiong)”, “高雄(Kaohsiung)” can be inferentially tagged as “PER”.
Second, the semantic transformation of Zone-K is introduced to help determine the boundaries. This is similar to contour detection in images, where local semantic fusion can detect semantic change boundaries. For this purpose, we introduce CNN, which uses sliding windows to fuse short-sequence information of the text to perceive local sequence features of the text. Furthermore, Chiu and Nichols proposed to combine LSTM and CNN networks to learn character–word level information for English NERs [
20]. This inspired us to provide a model that combines short-sequence CNN and LSTM coding. The local features of the text are extracted using short-sequence CNN, resulting in a local contextual representation. The global context representation is obtained by implicitly encoding the character sequence using LSTM, and then the local and global representations are used together for NER.
Our approach bridges the gap in the following aspects compared with previous approaches. First, the model adopts feature representation, context encoder, and tag decoder architecture, which has the property of migrating to other networks. Moreover, it can be combined with BERT [
17]. Second, we introduce lexical information from the character representation layer based on the SoftLexicon method, which is simple and direct. The graph neural network is used to directly capture the lexical semantic information of entity neighborhoods without the need to construct dependency parse trees with the help of external NLP tools. It effectively avoids error propagation issues, makes up for information loss, and improves the performance of Chinese NER. Finally, the sequence coding layer of the model effectively balances the acquisition of local and global information and enhances the recognition of entity boundaries. Adding GAT simply and effectively improves the prediction accuracy of entity types.
The present study can be succinctly outlined by considering the following facets:
We constructed a Chinese NER method with enhanced local information perception. The method directly utilizes local lexical information to capture the semantic relationship between entity characters and matching lexical entries through graph attention networks. There is no need to build dependency parse trees with the help of external NLP tools. This avoids the problem of error propagation caused by this process, thus effectively improving NER performance.
We used a modified short-sequence CNN to fuse local features and achieve encoding of shorter subsequence features by an additional sliding window module. Then we combined it with LSTM to obtain a global representation of the sequence. It compensates for the shortcomings of existing sequence encoders in extracting local and global features.
The experiment achieved advanced results on standard Chinese NER datasets in four domains. Moreover, entity-type prediction accuracy improved, indicating that the proposed local information-aware approach is interpretable.