For knowledge intensive NLP tasks, it has been widely accepted that accessing more information is a contributing factor to improvements in the model's end-to-end performance. However, counter-intuitively, too much context can have a negative impact on the model when evaluated on common question answering (QA) datasets. In this paper, we analyze how passages can have a detrimental effect on retrieve-then-read architectures used in question answering. Our empirical evidence indicates that the current read architecture does not fully leverage the retrieved passages and significantly degrades its performance when using the whole passages compared to utilizing subsets of them. Our findings demonstrate that model accuracy can be improved by 10% on two popular QA datasets by filtering out detrimental passages. Additionally, these outcomes are attained by utilizing existing retrieval methods without further training or data. We further highlight the challenges associated with identifying the detrimental passages. First, even with the correct context, the model can make an incorrect prediction, posing a challenge in determining which passages are most influential. Second, evaluation typically considers lexical matching, which is not robust to variations of correct answers. Despite these limitations, our experimental results underscore the pivotal role of identifying and removing these detrimental passages for the context-efficient retrieve-then-read pipeline. Code and data are available at https://github.com/xfactlab/emnlp2023-damaging-retrieval
翻译:对于知识密集型自然语言处理任务,广泛认为访问更多信息有助于提升模型的端到端性能。然而,反直觉的是,在常见问答数据集上评估时,过多上下文反而可能对模型产生负面影响。本文分析了在问答任务中,段落如何对"检索-读取"架构产生有害影响。实验证据表明,当前读取架构未能充分利用检索到的段落,且使用完整段落时性能显著低于使用段落子集。研究发现,通过过滤有害段落,模型在两大流行问答数据集上的准确率可提升10%。此外,这些结果无需额外训练或数据,仅利用现有检索方法即可实现。我们进一步揭示了识别有害段落面临的挑战:首先,即使存在正确上下文,模型仍可能做出错误预测,这给判定最具影响力的段落带来了困难;其次,评估通常采用词法匹配方式,该方式对正确答案的变体缺乏鲁棒性。尽管存在这些局限,我们的实验结果仍强调了在上下文高效的"检索-读取"流水线中识别并移除有害段落的关键作用。代码和数据已开源至 https://github.com/xfactlab/emnlp2023-damaging-retrieval