Although Retrieval-Augmented Large Language Models (RALMs) demonstrate their superiority in terms of factuality, they do not consistently outperform the original retrieval-free Language Models (LMs). Our experiments reveal that this example-level performance inconsistency exists not only between retrieval-augmented and retrieval-free LM but also among different retrievers. To understand this phenomenon, we investigate the degeneration behavior of RALMs and theoretically decompose it into four categories. Further analysis based on our decomposition reveals that the innate difference in knowledge sources and the unpredictable degeneration of the reader model contribute most to the inconsistency. Drawing from our analysis, we introduce Ensemble of Retrievers (EoR), a trainable framework that can adaptively retrieve from different knowledge sources and effectively decrease unpredictable reader errors. Our experiments on Open Domain Question Answering show that EoR substantially improves performance over the RALM with a single retriever by considerably reducing inconsistent behaviors.
翻译:尽管检索增强大型语言模型(RALMs)在事实性方面展现了优越性,但其性能并不总是优于原始的免检索语言模型(LMs)。我们的实验表明,这种示例层面的性能不一致性不仅存在于检索增强模型与免检索模型之间,也存在于不同的检索器之间。为理解这一现象,我们研究了RALMs的退化行为,并从理论上将其分解为四种类别。基于此分解的进一步分析表明,知识来源的内在差异以及阅读器模型不可预测的退化是导致不一致性的主要原因。基于分析,我们提出了检索器集成(EoR),这是一个可训练的框架,能够自适应地从不同知识源进行检索,并有效减少不可预测的阅读器错误。我们在开放域问答任务上的实验表明,EoR通过显著减少不一致行为,大幅提升了基于单一检索器的RALM的性能。