As neural language models achieve human-comparable performance on Machine Reading Comprehension (MRC) and see widespread adoption, ensuring their robustness in real-world scenarios has become increasingly important. Current robustness evaluation research, though, primarily develops synthetic perturbation methods, leaving unclear how well they reflect real life scenarios. Considering this, we present a framework to automatically examine MRC models on naturally occurring textual perturbations, by replacing paragraph in MRC benchmarks with their counterparts based on available Wikipedia edit history. Such perturbation type is natural as its design does not stem from an arteficial generative process, inherently distinct from the previously investigated synthetic approaches. In a large-scale study encompassing SQUAD datasets and various model architectures we observe that natural perturbations result in performance degradation in pre-trained encoder language models. More worryingly, these state-of-the-art Flan-T5 and Large Language Models (LLMs) inherit these errors. Further experiments demonstrate that our findings generalise to natural perturbations found in other more challenging MRC benchmarks. In an effort to mitigate these errors, we show that it is possible to improve the robustness to natural perturbations by training on naturally or synthetically perturbed examples, though a noticeable gap still remains compared to performance on unperturbed data.
翻译:随着神经语言模型在机器阅读理解任务上达到与人类相当的性能并得到广泛应用,确保其在现实场景中的鲁棒性变得日益重要。然而,当前的鲁棒性评估研究主要开发合成扰动方法,尚不清楚这些方法在多大程度上反映了真实场景。鉴于此,我们提出了一个框架,通过将MRC基准测试中的段落替换为基于可用维基百科编辑历史的对应版本,自动评估MRC模型在自然发生的文本扰动上的表现。此类扰动类型是自然的,因为其设计并非源于人工生成过程,与先前研究的合成方法存在本质区别。在涵盖SQUAD数据集及多种模型架构的大规模研究中,我们观察到自然扰动会导致预训练编码器语言模型的性能下降。更令人担忧的是,最先进的Flan-T5和大型语言模型也继承了这些错误。进一步的实验表明,我们的发现在其他更具挑战性的MRC基准测试中的自然扰动上具有普适性。为缓解这些错误,我们证明通过使用自然或合成扰动示例进行训练,可以提高对自然扰动的鲁棒性,但与未扰动数据上的性能相比仍存在明显差距。