Retrieval-augmented language models (RALMs) hold promise to produce language understanding systems that are are factual, efficient, and up-to-date. An important desideratum of RALMs, is that retrieved information helps model performance when it is relevant, and does not harm performance when it is not. This is particularly important in multi-hop reasoning scenarios, where misuse of irrelevant evidence can lead to cascading errors. However, recent work has shown that retrieval augmentation can sometimes have a negative effect on performance. In this work, we present a thorough analysis on five open-domain question answering benchmarks, characterizing cases when retrieval reduces accuracy. We then propose two methods to mitigate this issue. First, a simple baseline that filters out retrieved passages that do not entail question-answer pairs according to a natural language inference (NLI) model. This is effective in preventing performance reduction, but at a cost of also discarding relevant passages. Thus, we propose a method for automatically generating data to fine-tune the language model to properly leverage retrieved passages, using a mix of relevant and irrelevant contexts at training time. We empirically show that even 1,000 examples suffice to train the model to be robust to irrelevant contexts while maintaining high performance on examples with relevant ones.
翻译:检索增强语言模型(RALMs)有望构建出事实准确、高效且及时更新的语言理解系统。这类模型的一个关键需求是:当检索信息相关时,能提升模型性能;当检索信息无关时,则不应损害性能。这一特性在多跳推理场景中尤为重要——误用无关证据可能导致级联错误。然而,近期研究表明,检索增强有时会对性能产生负面影响。本研究对五个开放域问答基准进行了深入分析,刻画了检索降低准确率的具体情形,并提出两种缓解该问题的方案。第一种是简单基线方法:根据自然语言推理(NLI)模型过滤掉不蕴含问答对的检索段落。该方法虽能有效防止性能下降,但会误删相关段落。因此,我们提出第二种方法:自动生成混合相关与无关上下文的数据,用于微调语言模型,使其能正确利用检索段落。实验表明,仅需1000个训练样本即可使模型对无关上下文保持鲁棒性,同时在包含相关上下文的样本上维持高性能。