Uncertainty Quantification for Retrieval-Augmented Reasoning

Retrieval-augmented reasoning (RAR) is a recent evolution of retrieval-augmented generation (RAG) that employs multiple reasoning steps for retrieval and generation. While effective for some complex queries, RAR remains vulnerable to errors and misleading outputs. Uncertainty quantification (UQ) offers methods to estimate the confidence of systems' outputs. These methods, however, often handle simple queries with no retrieval or single-step retrieval, without properly handling RAR setup. Accurate estimation of UQ for RAR requires accounting for all sources of uncertainty, including those arising from retrieval and generation. In this paper, we account for all these sources and introduce Retrieval-Augmented Reasoning Consistency (R2C)--a novel UQ method for RAR. The core idea of R2C is to perturb the multi-step reasoning process by applying various actions to reasoning steps. These perturbations alter the retriever's input, which shifts its output and consequently modifies the generator's input at the next step. Through this iterative feedback loop, the retriever and generator continuously reshape one another's inputs, enabling us to capture uncertainty arising from both components. Experiments on five popular RAR systems across diverse QA datasets show that R2C improves AUROC by over 5% on average compared to the state-of-the-art UQ baselines. Extrinsic evaluations using R2C as an external signal further confirm its effectiveness for two downstream tasks: in Abstention, it achieves ~5% gains in both F1Abstain and AccAbstain; in Model Selection, it improves the exact match by ~7% over single models and ~3% over selection methods.

翻译：检索增强推理（RAR）是检索增强生成（RAG）的最新演进形式，其通过多步推理进行检索与生成。尽管RAR在处理某些复杂查询时表现有效，但仍易受错误和误导性输出的影响。不确定性量化（UQ）提供了评估系统输出置信度的方法。然而，现有方法通常仅针对无检索或单步检索的简单查询，未能妥善处理RAR框架。对RAR进行准确的不确定性量化需全面考虑所有不确定性来源，包括检索与生成过程中产生的误差。本文通过整合这些不确定性来源，提出了检索增强推理一致性（R2C）——一种面向RAR的新型UQ方法。R2C的核心思想是通过对推理步骤施加多种扰动操作来干扰多步推理过程。这些扰动会改变检索器的输入，进而影响其输出，并最终调整下一步生成器的输入。通过这种迭代反馈循环，检索器与生成器持续重塑彼此的输入，使我们能够捕捉来自这两个组件的不确定性。在涵盖多样化问答数据集的五个主流RAR系统上的实验表明，相较于最先进的UQ基线方法，R2C将AUROC平均提升超过5%。将R2C作为外部信号进行的外在评估进一步验证了其在两项下游任务中的有效性：在弃权任务中，F1Abstain与AccAbstain指标均获得约5%的提升；在模型选择任务中，其精确匹配率较单一模型提升约7%，较现有选择方法提升约3%。