Retrieval-augmented reasoning (RAR) is a recent evolution of retrieval-augmented generation (RAG) that employs multiple reasoning steps for retrieval and generation. While effective for some complex queries, RAR remains vulnerable to errors and misleading outputs. Uncertainty quantification (UQ) offers methods to estimate the confidence of systems' outputs. These methods, however, often handle simple queries with no retrieval or single-step retrieval, without properly handling RAR setup. Accurate estimation of UQ for RAR requires accounting for all sources of uncertainty, including those arising from retrieval and generation. In this paper, we account for all these sources and introduce Retrieval-Augmented Reasoning Consistency (R2C)--a novel UQ method for RAR. The core idea of R2C is to perturb the multi-step reasoning process by applying various actions to reasoning steps. These perturbations alter the retriever's input, which shifts its output and consequently modifies the generator's input at the next step. Through this iterative feedback loop, the retriever and generator continuously reshape one another's inputs, enabling us to capture uncertainty arising from both components. Experiments on five popular RAR systems across diverse QA datasets show that R2C improves AUROC by over 5% on average compared to the state-of-the-art UQ baselines. Extrinsic evaluations using R2C as an external signal further confirm its effectiveness for two downstream tasks: in Abstention, it achieves ~5% gains in both F1Abstain and AccAbstain; in Model Selection, it improves the exact match by ~7% over single models and ~3% over selection methods.
翻译:检索增强推理(RAR)是检索增强生成(RAG)的最新演进,它采用多步推理进行检索与生成。尽管对某些复杂查询有效,RAR 仍易受错误和误导性输出的影响。不确定性量化(UQ)提供了估计系统输出置信度的方法。然而,这些方法通常处理无检索或单步检索的简单查询,未能妥善处理 RAR 设置。准确估计 RAR 的 UQ 需要考虑所有不确定性来源,包括检索和生成过程中产生的。本文综合考虑了所有这些来源,并提出了检索增强推理一致性(R2C)——一种用于 RAR 的新型 UQ 方法。R2C 的核心思想是通过对推理步骤施加多种操作来扰动多步推理过程。这些扰动改变了检索器的输入,进而使其输出发生偏移,并最终修改下一步生成器的输入。通过这种迭代反馈循环,检索器和生成器持续重塑彼此的输入,使我们能够捕获来自这两个组件的不确定性。在多样化问答数据集上对五个主流 RAR 系统进行的实验表明,与最先进的 UQ 基线相比,R2C 将 AUROC 平均提升了 5% 以上。使用 R2C 作为外部信号的额外评估进一步证实了其在两个下游任务中的有效性:在弃权任务中,F1Abstain 和 AccAbstain 均获得约 5% 的提升;在模型选择任务中,其精确匹配率比单一模型提高约 7%,比选择方法提高约 3%。