CUE-R: Beyond the Final Answer in Retrieval-Augmented Generation

from arxiv, 6 figures, 14 tables; appendix includes bootstrap CIs, metric definitions, duplicate position sensitivity, prompt template, and reproducibility details

As language models shift from single-shot answer generation toward multi-step reasoning that retrieves and consumes evidence mid-inference, evaluating the role of individual retrieved items becomes more important. Existing RAG evaluation typically targets final-answer quality, citation faithfulness, or answer-level attribution, but none of these directly targets the intervention-based, per-evidence-item utility view we study here. We introduce CUE-R, a lightweight intervention-based framework for measuring per-evidence-item operational utility in single-shot RAG using shallow observable retrieval-use traces. CUE-R perturbs individual evidence items via REMOVE, REPLACE, and DUPLICATE operators, then measures changes along three utility axes (correctness, proxy-based grounding faithfulness, and confidence error) plus a trace-divergence signal. We also outline an operational evidence-role taxonomy for interpreting intervention outcomes. Experiments on HotpotQA and 2WikiMultihopQA with Qwen-3 8B and GPT-5.2 reveal a consistent pattern: REMOVE and REPLACE substantially harm correctness and grounding while producing large trace shifts, whereas DUPLICATE is often answer-redundant yet not fully behaviorally neutral. A zero-retrieval control confirms that these effects arise from degradation of meaningful retrieval. A two-support ablation further shows that multi-hop evidence items can interact non-additively: removing both supports harms performance far more than either single removal. Our results suggest that answer-only evaluation misses important evidence effects and that intervention-based utility analysis is a practical complement for RAG evaluation.

翻译：随着语言模型从单次答案生成转向在推理过程中检索并消耗证据的多步推理，评估单个检索项的作用变得更为重要。现有的检索增强生成（RAG）评估通常针对最终答案质量、引用忠实性或答案级归因，但所有这些方法均未直接针对本文研究的基于干预的、逐证据项的效用视角。我们提出CUE-R，一种轻量级的基于干预的框架，通过利用浅层可观测的检索使用痕迹，在单次RAG中测量逐证据项的操作效用。CUE-R通过移除、替换和复制算子扰动单个证据项，然后沿三个效用轴（正确性、基于代理的引证忠实性和置信度误差）以及一个痕迹发散信号测量变化。我们还概述了一个用于解释干预结果的操作性证据角色分类法。在HotpotQA和2WikiMultihopQA数据集上使用Qwen-3 8B和GPT-5.2进行的实验揭示出一致模式：移除和替换显著损害正确性和引证，同时产生大幅痕迹偏移，而复制通常对答案而言是冗余的，但并非完全在行为上中性。零检索对照实验证实，这些效应源于有意义检索的退化。进一步的二支持消融实验表明，多跳证据项可能以非加性方式交互：同时移除两个支持项对性能的损害远大于单独移除任一支持项。我们的结果表明，仅基于答案的评价会遗漏重要的证据效应，而基于干预的效用分析是RAG评估的一种实用补充方法。