Retrieval-Augmented Generation (RAG) couples document retrieval with large language models (LLMs). While scaling generators often improves accuracy, it also increases inference and deployment overhead. We study an orthogonal axis: enlarging the retriever's corpus, and how it trades off with generator scale. Across multiple open-domain QA benchmarks, corpus scaling consistently strengthens RAG and can in many cases match the gains of moving to a larger model tier, though with diminishing returns at larger scales. Small- and mid-sized generators paired with larger corpora often rival much larger models with smaller corpora; mid-sized models tend to gain the most, while tiny and very large models benefit less. Our analysis suggests that these improvements arise primarily from increased coverage of answer-bearing passages, while utilization efficiency remains largely unchanged. Overall, our results characterize a corpus-generator trade-off in RAG and provide empirical guidance on how corpus scale and model capacity interact in this setting.
翻译:检索增强生成(RAG)将文档检索与大型语言模型(LLM)相结合。虽然扩展生成器规模通常能提升准确性,但也会增加推理与部署开销。我们研究了一个正交维度:扩大检索器的语料库规模,及其与生成器规模之间的权衡关系。在多个开放域问答基准测试中,语料库扩展持续增强RAG性能,且在许多情况下能达到升级至更大模型层级所带来的增益水平,尽管在更大规模时收益递减。中小型生成器搭配更大语料库的表现,常可与使用较小语料库的更大模型相媲美;中型模型往往获益最大,而极小模型和超大规模模型受益较少。我们的分析表明,这些改进主要源于答案相关段落覆盖率的提升,而利用效率基本保持不变。总体而言,我们的结果刻画了RAG中语料库与生成器的权衡关系,并为该场景下语料库规模与模型容量如何交互提供了实证指导。