Lost in the Evidence? Reproducing Document Position and Context Size Effects in RAG

Retrieval-Augmented Generation (RAG) systems rely on retrieved documents being concatenated into a model's input context, making both document ordering and context size critical yet controversial design choices. Prior work reports position-based effects such as lost in the middle and related long-context phenomena. However, empirical findings remain inconsistent and hard to reproduce across models, datasets, and evaluation protocols. In this paper, we present a systematic reproducibility study that revisits these claims and examines how they evolve with contemporary LLMs under a controlled evaluation framework. We first show that topic sampling is a major source of variance: small topic sets can mask or exaggerate ordering effects. Based on repeated subset sampling across multiple topic budgets, we provide a practical calibration procedure that identifies topic counts yielding stable trends at feasible cost. Using these fixed topic sets, we then reproduce and extend results on position sensitivity, re-evaluating lost in the middle and positional biases in modern LLMs. Then, we also study a more realistic RAG scenario in which relevance is mediated by a retriever rather than oracle access to ground-truth documents. In this setting, we re-examine a recent industry study and identify discrepancies to evaluation choices such as limited topic coverage and reliance on LLM-based judges. Finally, we conduct an analysis of how retrieval order and context size affect downstream LLM performance under imperfect retrieval. Our results demonstrate that both factors interact strongly with retrieval quality and model choice, and that conclusions drawn from idealised setups do not always transfer to real-world RAG pipelines. We release all code and configurations to support reproducibility and future work on robust RAG evaluation.

翻译：[翻译摘要] 检索增强生成系统依赖于将检索到的文档拼接至模型的输入上下文，这使得文档排序与上下文长度成为关键但存在争议的设计选择。先前研究报道了基于位置效应（如"中间迷失"）及相关长上下文现象。然而，跨模型、数据集和评估协议的实证结果仍存在不一致且难以复现的问题。本文提出一项系统性可重复性研究，重新审视这些论断，并考察在受控评估框架下当代大语言模型如何演变这些效应。我们首先证明主题采样是变异的主要来源：小型主题集可能掩盖或夸大排序效应。基于多主题预算下的重复子集采样，我们提出一种实用的校准流程，可识别出能以可行成本产生稳定趋势的主题数量。利用这些固定主题集，我们复现并扩展了关于位置敏感性的结果，重新评估现代大语言模型中的"中间迷失"与位置偏差。随后，我们还研究了一个更真实的RAG场景：其中相关性由检索器中介，而非通过访问真实文档的先验知识。在此设定下，我们重新审视近期一项行业研究，发现其与评估选择（如有限主题覆盖度及依赖基于大语言模型的评委）之间存在不一致性。最后，我们分析了不完美检索条件下检索顺序与上下文长度如何影响下游大语言模型性能。结果表明，这两个因素均与检索质量和模型选择强相互作用，且理想化设定得出的结论并不总能迁移到实际RAG流水线。我们公开所有代码与配置，以支持可重复性及未来关于稳健RAG评估的研究。