Named Entity Recognition (NER) is a critical task in natural language processing, yet it remains particularly challenging for discontinuous entities. The primary difficulty lies in text segmentation, as traditional methods often missegment or entirely miss cross-sentence discontinuous entities, significantly affecting recognition accuracy. Therefore, we aim to address the segmentation and omission issues associated with such entities. Recent studies have shown that grid-tagging methods are effective for information extraction due to their flexible tagging schemes and robust architectures. Building on this, we integrate image data augmentation techniques, such as cropping, scaling, and padding, into grid-based models to enhance their ability to recognize discontinuous entities and handle segmentation challenges. Experimental results demonstrate that traditional segmentation methods often fail to capture cross-sentence discontinuous entities, leading to decreased performance. In contrast, our augmented grid models achieve notable improvements. Evaluations on the CADEC, ShARe13, and ShARe14 datasets show F1 score gains of 1-2.5% overall and 3.7-8.4% for discontinuous entities, confirming the effectiveness of our approach.
翻译:命名实体识别(NER)是自然语言处理中的关键任务,但对于非连续实体而言仍极具挑战性。主要困难在于文本分割,传统方法常错误分割或完全遗漏跨句非连续实体,显著影响识别准确率。因此,我们旨在解决此类实体相关的分割与遗漏问题。近期研究表明,网格标注方法因其灵活的标注方案和鲁棒的架构,在信息抽取中表现优异。基于此,我们将图像数据增强技术(如裁剪、缩放和填充)整合到基于网格的模型中,以增强其识别非连续实体及处理分割挑战的能力。实验结果表明,传统分割方法常无法捕捉跨句非连续实体,导致性能下降。相比之下,我们增强后的网格模型取得了显著提升。在CADEC、ShARe13和ShARe14数据集上的评估显示,整体F1分数提高了1-2.5%,非连续实体的F1分数提升达3.7-8.4%,验证了本方法的有效性。