Retrieval-Augmented Generation (RAG) systems rely on retrieved documents being concatenated into a model's input context, making both document ordering and context size critical yet controversial design choices. Prior work reports position-based effects such as lost in the middle and related long-context phenomena. However, empirical findings remain inconsistent and hard to reproduce across models, datasets, and evaluation protocols. In this paper, we present a systematic reproducibility study that revisits these claims and examines how they evolve with contemporary LLMs under a controlled evaluation framework. We first show that topic sampling is a major source of variance: small topic sets can mask or exaggerate ordering effects. Based on repeated subset sampling across multiple topic budgets, we provide a practical calibration procedure that identifies topic counts yielding stable trends at feasible cost. Using these fixed topic sets, we then reproduce and extend results on position sensitivity, re-evaluating lost in the middle and positional biases in modern LLMs. Then, we also study a more realistic RAG scenario in which relevance is mediated by a retriever rather than oracle access to ground-truth documents. In this setting, we re-examine a recent industry study and identify discrepancies to evaluation choices such as limited topic coverage and reliance on LLM-based judges. Finally, we conduct an analysis of how retrieval order and context size affect downstream LLM performance under imperfect retrieval. Our results demonstrate that both factors interact strongly with retrieval quality and model choice, and that conclusions drawn from idealised setups do not always transfer to real-world RAG pipelines. We release all code and configurations to support reproducibility and future work on robust RAG evaluation.
翻译:检索增强生成系统依赖将检索到的文档拼接至模型的输入上下文中,这使得文档排序和上下文大小成为关键且具争议的设计选择。已有研究报道了基于位置效应的现象,如"中间丢失"及相关长上下文现象。然而,不同模型、数据集和评估协议下的实证结果仍不一致且难以复现。本文提出了一项系统性复现研究,通过受控评估框架重新审视这些主张,并考察其随当代大语言模型演化的规律。首先,我们证明主题采样是方差的主要来源:小规模主题集会掩盖或夸大排序效应。基于跨多个主题预算的重复子集采样,我们提供了一种实用的校准流程,能够以可行代价确定产生稳定趋势的主题数量。利用这些固定的主题集,我们复现并扩展了位置敏感性研究结果,重新评估了现代LLM中的"中间丢失"和位置偏差。进而,我们研究了更真实的RAG场景:文档相关性通过检索器而非真实文档的"先知"访问进行调控。在此设定下,我们重新审视了一项近期行业研究,发现其与有限主题覆盖范围、依赖LLM判分器等评估选择存在偏差。最后,我们分析了在非完美检索条件下,检索顺序和上下文大小如何影响下游LLM性能。结果表明,这两个因素与检索质量及模型选择存在强交互作用,且从理想化设定中得出的结论并不总能迁移至实际RAG流水线。我们开源所有代码与配置,以支持复现及未来鲁棒RAG评估研究。