Prior work has uncovered a set of common problems in state-of-the-art context-based question answering (QA) systems: a lack of attention to the context when the latter conflicts with a model's parametric knowledge, little robustness to noise, and a lack of consistency with their answers. However, most prior work focus on one or two of those problems in isolation, which makes it difficult to see trends across them. We aim to close this gap, by first outlining a set of -- previously discussed as well as novel -- desiderata for QA models. We then survey relevant analysis and methods papers to provide an overview of the state of the field. The second part of our work presents experiments where we evaluate 15 QA systems on 5 datasets according to all desiderata at once. We find many novel trends, including (1) systems that are less susceptible to noise are not necessarily more consistent with their answers when given irrelevant context; (2) most systems that are more susceptible to noise are more likely to correctly answer according to a context that conflicts with their parametric knowledge; and (3) the combination of conflicting knowledge and noise can reduce system performance by up to 96%. As such, our desiderata help increase our understanding of how these models work and reveal potential avenues for improvements.
翻译:先前研究已揭示当前最先进的基于上下文的问答(QA)系统中存在一系列常见问题:当上下文与模型参数知识冲突时缺乏对上下文的关注、对噪声的鲁棒性不足以及答案缺乏一致性。然而,大多数先前工作仅孤立地关注其中一两个问题,这使得难以发现跨问题的共性趋势。我们旨在填补这一空白,首先归纳出一组QA模型需要满足的期望特征(包括先前讨论过的新提出的特征),继而调研相关分析与方法论文以概述领域现状。本工作的第二部分通过实验,依据所有期望特征对5个数据集上的15个问答系统进行综合评估。我们发现诸多新颖趋势,包括:(1)对噪声不敏感的系统未必能在接收到无关上下文时保持答案一致性;(2)对噪声更敏感的大多数系统更倾向于依据与参数知识矛盾的上下文正确作答;(3)矛盾知识与噪声的组合可导致系统性能下降高达96%。因此,我们的期望特征有助于深化对这些模型工作机制的理解,并揭示潜在的改进方向。