Retrieval-augmented reasoning (RAR) is a recent evolution of retrieval-augmented generation (RAG) that employs multiple reasoning steps for retrieval and generation. While effective for some complex queries, RAR remains vulnerable to errors and misleading outputs. Uncertainty quantification (UQ) offers methods to estimate the confidence of systems' outputs. These methods, however, often handle simple queries with no retrieval or single-step retrieval, without properly handling RAR setup. Accurate estimation of UQ for RAR requires accounting for all sources of uncertainty, including those arising from retrieval and generation. In this paper, we account for all these sources and introduce Retrieval-Augmented Reasoning Consistency (R2C)--a novel UQ method for RAR. The core idea of R2C is to perturb the multi-step reasoning process by applying various actions to reasoning steps. These perturbations alter the retriever's input, which shifts its output and consequently modifies the generator's input at the next step. Through this iterative feedback loop, the retriever and generator continuously reshape one another's inputs, enabling us to capture uncertainty arising from both components. Experiments on five popular RAR systems across diverse QA datasets show that R2C improves AUROC by over 5% on average compared to the state-of-the-art UQ baselines. Extrinsic evaluations using R2C as an external signal further confirm its effectiveness for two downstream tasks: in Abstention, it achieves ~5% gains in both F1Abstain and AccAbstain; in Model Selection, it improves the exact match by ~7% over single models and ~3% over selection methods.
翻译:检索增强推理(RAR)是检索增强生成(RAG)的最新演进,它采用多步推理步骤进行检索与生成。尽管对某些复杂查询有效,RAR仍易出现错误及误导性输出。不确定性量化(UQ)提供了评估系统输出置信度的方法,但这些方法通常仅处理无检索或单步检索的简单查询,未能妥善处理RAR场景。准确估计RAR的UQ需考虑所有不确定性来源,包括检索与生成环节产生的不确定性。本文全面考量这些来源,提出检索增强推理一致性(R2C)——一种针对RAR的新型UQ方法。R2C的核心思想是通过对推理步骤施加多种扰动来干扰多步推理过程:这些扰动改变检索器的输入,进而改变其输出,从而影响下一步生成器的输入。通过此迭代反馈回路,检索器与生成器持续重塑彼此的输入,使我们能捕获两者带来的不确定性。在五个主流RAR系统及多样化问答数据集上的实验表明,R2C较最先进UQ基线方法平均将AUROC提升超5%。将R2C作为外部信号的间接评估进一步验证了其在两项下游任务中的有效性:在弃权任务中,F1Abstain与AccAbstain指标均提升约5%;在模型选择任务中,相较于单模型和现有选择方法,精确匹配率分别提升约7%和3%。