Question answering (QA) models have shown compelling results in the task of Machine Reading Comprehension (MRC). Recently these systems have proved to perform better than humans on held-out test sets of datasets e.g. SQuAD, but their robustness is not guaranteed. The QA model's brittleness is exposed when evaluated on adversarial generated examples by a performance drop. In this study, we explore the robustness of MRC models to entity renaming, with entities from low-resource regions such as Africa. We propose EntSwap, a method for test-time perturbations, to create a test set whose entities have been renamed. In particular, we rename entities of type: country, person, nationality, location, organization, and city, to create AfriSQuAD2. Using the perturbed test set, we evaluate the robustness of three popular MRC models. We find that compared to base models, large models perform well comparatively on novel entities. Furthermore, our analysis indicates that entity type person highly challenges the MRC models' performance.
翻译:问答(QA)模型在机器阅读理解(MRC)任务中展现出令人瞩目的成果。近期,这些系统在SQuAD等数据集的标准测试集上表现出优于人类的表现,但其鲁棒性尚未得到保证。当在对抗性生成的示例上评估时,QA模型的脆弱性会通过性能下降而暴露。在本研究中,我们探索MRC模型对实体重命名的鲁棒性,其中实体来自低资源地区(如非洲)。我们提出EntSwap方法,一种用于测试时扰动的技术,以创建实体已被重命名的测试集。具体而言,我们对实体类型(包括国家、人物、国籍、地点、组织和城市)进行重命名,以构建AfriSQuAD2。利用该扰动测试集,我们评估了三种主流MRC模型的鲁棒性。研究发现,与基础模型相比,大型模型在新实体上表现相对较好。此外,我们的分析表明,"人物"实体类型对MRC模型的性能构成极大挑战。