Retrieval-Augmented Generation (RAG) systems commonly adopt retrieval fusion techniques such as multi-query retrieval and reciprocal rank fusion (RRF) to increase document recall, under the assumption that higher recall leads to better answer quality. While these methods show consistent gains in isolated retrieval benchmarks, their effectiveness under realistic production constraints remains underexplored. In this work, we evaluate retrieval fusion in a production-style RAG pipeline operating over an enterprise knowledge base, with fixed retrieval depth, re-ranking budgets, and latency constraints. Across multiple fusion configurations, we find that retrieval fusion does increase raw recall, but these gains are largely neutralized after re-ranking and truncation. In our setting, fusion variants fail to outperform single-query baselines on KB-level Top-$k$ accuracy, with Hit@10 decreasing from $0.51$ to $0.48$ in several configurations. Moreover, fusion introduces additional latency overhead due to query rewriting and larger candidate sets, without corresponding improvements in downstream effectiveness. Our analysis suggests that recall-oriented fusion techniques exhibit diminishing returns once realistic re-ranking limits and context budgets are applied. We conclude that retrieval-level improvements do not reliably translate into end-to-end gains in production RAG systems, and argue for evaluation frameworks that jointly consider retrieval quality, system efficiency, and downstream impact.
翻译:检索增强生成系统通常采用多查询检索和互逆排序融合等检索融合技术来提高文档召回率,其基本假设是更高的召回率会带来更好的答案质量。尽管这些方法在独立检索基准测试中表现出稳定的性能提升,但在实际生产环境约束下的有效性仍未得到充分探索。本研究在企业知识库的生产级RAG流水线中评估检索融合技术,该系统具有固定的检索深度、重排序预算和延迟约束。通过对多种融合配置的测试,我们发现检索融合确实能提高原始召回率,但这些增益在经过重排序和截断处理后基本被抵消。在我们的实验环境中,融合方案在知识库级别的Top-$k$准确率上未能超越单查询基线,多个配置中的Hit@10从$0.51$降至$0.48$。此外,由于查询重写和更大的候选集,融合技术会引入额外的延迟开销,却未能带来下游任务效果的相应提升。我们的分析表明,当应用实际的重排序限制和上下文预算时,面向召回率的融合技术呈现收益递减现象。我们得出结论:检索层面的改进并不能可靠地转化为生产RAG系统的端到端性能提升,因此主张建立同时考量检索质量、系统效率和下游影响的评估框架。