Retrieval-augmented generation (RAG) greatly benefits language models (LMs) by providing additional context for tasks such as document-based question answering (DBQA). Despite its potential, the power of RAG is highly dependent on its configuration, raising the question: What is the optimal RAG configuration? To answer this, we introduce the RAGGED framework to analyze and optimize RAG systems. On a set of representative DBQA tasks, we study two classic sparse and dense retrievers, and four top-performing LMs in encoder-decoder and decoder-only architectures. Through RAGGED, we uncover that different models suit substantially varied RAG setups. While encoder-decoder models monotonically improve with more documents, we find decoder-only models can only effectively use < 5 documents, despite often having a longer context window. RAGGED offers further insights into LMs' context utilization habits, where we find that encoder-decoder models rely more on contexts and are thus more sensitive to retrieval quality, while decoder-only models tend to rely on knowledge memorized during training.
翻译:检索增强生成(RAG)通过为文档问答(DBQA)等任务提供额外上下文,极大地提升了语言模型(LM)的性能。尽管潜力巨大,但其效果高度依赖于配置方式,这引发了一个核心问题:何种RAG配置最为理想?为解答此问题,我们提出了RAGGED框架,用于分析与优化RAG系统。在一组具有代表性的DBQA任务上,我们研究了两种经典稀疏与稠密检索器,以及四种在编码器-解码器与仅有解码器架构中表现最优的语言模型。通过RAGGED,我们发现不同模型适用于差异显著的RAG设置:编码器-解码器模型随文档数量增加而持续提升性能,而仅有解码器模型尽管通常拥有更长的上下文窗口,却仅能有效利用少于5篇文档。RAGGED进一步揭示了语言模型的上下文利用习惯——编码器-解码器模型更依赖上下文,因而对检索质量更为敏感;相比之下,仅有解码器模型则倾向于依赖训练期间记忆的知识。