Retrieval-augmented language models (RALMs) hold promise to produce language understanding systems that are are factual, efficient, and up-to-date. An important desideratum of RALMs, is that retrieved information helps model performance when it is relevant, and does not harm performance when it is not. This is particularly important in multi-hop reasoning scenarios, where misuse of irrelevant evidence can lead to cascading errors. However, recent work has shown that retrieval augmentation can sometimes have a negative effect on performance. In this work, we present a thorough analysis on five open-domain question answering benchmarks, characterizing cases when retrieval reduces accuracy. We then propose two methods to mitigate this issue. First, a simple baseline that filters out retrieved passages that do not entail question-answer pairs according to a natural language inference (NLI) model. This is effective in preventing performance reduction, but at a cost of also discarding relevant passages. Thus, we propose a method for automatically generating data to fine-tune the language model to properly leverage retrieved passages, using a mix of relevant and irrelevant contexts at training time. We empirically show that even 1,000 examples suffice to train the model to be robust to irrelevant contexts while maintaining high performance on examples with relevant ones.
翻译:检索增强语言模型(RALMs)有望构建出事实准确、高效且时新的语言理解系统。RALMs的一个重要特性是:当检索信息相关时,它应有助于模型性能提升;当信息不相关时,则不应损害模型性能。这在多跳推理场景中尤为关键,因为误用无关证据可能导致级联错误。然而,近期研究表明,检索增强有时会对性能产生负面影响。本研究对五个开放域问答基准进行了深入分析,明确了检索降低准确率的具体案例。随后,我们提出两种缓解该问题的方法:第一,一种简单基线方法,通过自然语言推理(NLI)模型过滤掉与问题-答案对不蕴含关系的检索段落。该方法能有效防止性能下降,但代价是可能丢弃相关段落。为此,我们提出一种自动生成数据的方法,在训练时混合相关与无关上下文,通过微调语言模型使其正确利用检索段落。实验表明,仅需1000个训练样本,即可使模型对无关上下文保持鲁棒性,同时维持对相关上下文的高性能表现。