In the paper, we propose a novel way of improving named entity recognition in the Korean language using its language-specific features. While the field of named entity recognition has been studied extensively in recent years, the mechanism of efficiently recognizing named entities in Korean has hardly been explored. This is because the Korean language has distinct linguistic properties that prevent models from achieving their best performances. Therefore, an annotation scheme for {Korean corpora} by adopting the CoNLL-U format, which decomposes Korean words into morphemes and reduces the ambiguity of named entities in the original segmentation that may contain functional morphemes such as postpositions and particles, is proposed herein. We investigate how the named entity tags are best represented in this morpheme-based scheme and implement an algorithm to convert word-based {and syllable-based Korean corpora} with named entities into the proposed morpheme-based format. Analyses of the results of {statistical and neural} models reveal that the proposed morpheme-based format is feasible, and the {varied} performances of the models under the influence of various additional language-specific features are demonstrated. Extrinsic conditions were also considered to observe the variance of the performances of the proposed models, given different types of data, including the original segmentation and different types of tagging formats.
翻译:本文提出了一种利用韩语语言特定特征改进命名实体识别的新方法。尽管命名实体识别领域近年来已得到广泛研究,但针对韩语高效识别命名实体的机制却鲜有探索。这是因为韩语具有独特的语言属性,阻碍了模型达到最佳性能。因此,本文提出了一种采用CoNLL-U格式的{韩语语料库}标注方案,该方案将韩语词汇分解为语素,并减少了原始分词中可能包含后置词、助词等功能语素的命名实体歧义性。我们研究了在这种基于语素的方案中如何最佳地表示命名实体标签,并实现了一种算法,将具有命名实体的基于词汇{和基于音节的韩语语料库}转换为所提出的基于语素的格式。对{统计模型和神经网络}模型结果的分析表明,所提出的基于语素的格式是可行的,并展示了在各种额外语言特定特征影响下模型的{不同}性能表现。同时考虑外部条件,观察所提出模型在包括原始分词和不同类型标注格式在内的不同数据类型下的性能差异。