Named entity recognition (NER) systems have seen rapid progress in recent years due to the development of deep neural networks. These systems are widely used in various natural language processing applications, such as information extraction, question answering, and sentiment analysis. However, the complexity and intractability of deep neural networks can make NER systems unreliable in certain circumstances, resulting in incorrect predictions. For example, NER systems may misidentify female names as chemicals or fail to recognize the names of minority groups, leading to user dissatisfaction. To tackle this problem, we introduce TIN, a novel, widely applicable approach for automatically testing and repairing various NER systems. The key idea for automated testing is that the NER predictions of the same named entities under similar contexts should be identical. The core idea for automated repairing is that similar named entities should have the same NER prediction under the same context. We use TIN to test two SOTA NER models and two commercial NER APIs, i.e., Azure NER and AWS NER. We manually verify 784 of the suspicious issues reported by TIN and find that 702 are erroneous issues, leading to high precision (85.0%-93.4%) across four categories of NER errors: omission, over-labeling, incorrect category, and range error. For automated repairing, TIN achieves a high error reduction rate (26.8%-50.6%) over the four systems under test, which successfully repairs 1,056 out of the 1,877 reported NER errors.
翻译:命名实体识别(NER)系统近年来因深度神经网络的发展取得了快速进展。这些系统广泛应用于各类自然语言处理应用,如信息抽取、问答系统和情感分析。然而,深度神经网络的复杂性与难以解释性可能导致NER系统在特定情境下不可靠,从而产生错误预测。例如,NER系统可能将女性姓名误识别为化学物质,或无法识别少数群体名称,导致用户不满。为解决这一问题,我们提出TIN——一种新颖且广泛适用的方法,用于自动测试和修复各类NER系统。自动化测试的核心思想是:同一命名实体在相似上下文中的NER预测结果应保持一致。自动化修复的核心思想是:相似的命名实体在同一上下文中应具有相同的NER预测结果。我们使用TIN测试了两种最先进的NER模型和两种商业NER API(即Azure NER和AWS NER)。人工验证了TIN报告的784个可疑问题,发现其中702个为实际错误问题,在四类NER错误(遗漏、过标注、类别错误和范围错误)中实现了高精度(85.0%-93.4%)。在自动化修复方面,TIN对四个被测系统实现了高错误率降低(26.8%-50.6%),成功修复了1877个报告NER错误中的1056个。