Named entity recognition (NER) models often struggle with noisy inputs, such as those with spelling mistakes or errors generated by Optical Character Recognition processes, and learning a robust NER model is challenging. Existing robust NER models utilize both noisy text and its corresponding gold text for training, which is infeasible in many real-world applications in which gold text is not available. In this paper, we consider a more realistic setting in which only noisy text and its NER labels are available. We propose to retrieve relevant text of the noisy text from a knowledge corpus and use it to enhance the representation of the original noisy input. We design three retrieval methods: sparse retrieval based on lexicon similarity, dense retrieval based on semantic similarity, and self-retrieval based on task-specific text. After retrieving relevant text, we concatenate the retrieved text with the original noisy text and encode them with a transformer network, utilizing self-attention to enhance the contextual token representations of the noisy text using the retrieved text. We further employ a multi-view training framework that improves robust NER without retrieving text during inference. Experiments show that our retrieval-augmented model achieves significant improvements in various noisy NER settings.
翻译:命名实体识别(NER)模型在处理噪声输入时往往表现不佳,例如包含拼写错误或光学字符识别过程生成错误的文本,而学习一个鲁棒的NER模型具有挑战性。现有的鲁棒NER模型在训练时同时利用噪声文本及其对应的标准文本,这在许多实际应用中并不可行,因为标准文本通常无法获取。本文考虑一种更现实的设定:仅使用噪声文本及其NER标签进行训练。我们提出从知识库中检索与噪声文本相关的文本,并利用其增强原始噪声输入的表示。我们设计了三种检索方法:基于词典相似度的稀疏检索、基于语义相似度的稠密检索,以及基于任务特定文本的自检索。检索到相关文本后,我们将检索文本与原始噪声文本拼接,通过Transformer网络进行编码,利用自注意力机制借助检索文本增强噪声文本的上下文词元表示。我们进一步采用多视图训练框架,该框架能在推理阶段不进行文本检索的情况下提升NER的鲁棒性。实验表明,我们的检索增强模型在多种噪声NER场景下均取得了显著性能提升。