Large language models (LLMs) can enhance factuality via retrieval-augmented generation (RAG), but applying RAG to every query is unnecessary when the model-only answer is reliable. This motivates cascaded RAG: each query is first handled by an LLM-only branch, escalated to a RAG fallback only if the primary branch is uncertain, and abstained from when neither branch is sufficiently trustworthy. However, calibrating such cascades stage by stage may be conservative, since the final utility depends on joint uncertainty thresholding of LLM-only and RAG. In this work, we develop BalanceRAG to certify threshold pairs at a target risk level. Given uncertainty scores from the two branches, BalanceRAG frames each threshold pair as an operating point on a two-dimensional lattice and identifies safe operating points using sequential graphical testing. This enables risk-adaptive threshold calibration, controlling the system-level error rate among accepted points, while retaining more examples. Furthermore, BalanceRAG extends to multi-risk calibration, allowing retrieval usage to be bounded together with the selection-conditioned risk. Experiments on three open-domain question answering (QA) benchmarks across multiple LLM backbones demonstrate that BalanceRAG meets prescribed risk levels, preserves higher coverage and more accepted correct examples, and reduces unnecessary retrieval calls compared with always-on RAG.
翻译:大型语言模型(LLM)可通过检索增强生成(RAG)提升事实准确性,但若仅依赖模型本身的答案即可可靠回答时,对每个查询都应用RAG并不必要。这促使了级联RAG的产生:每个查询首先由仅含LLM的分支处理,仅当主分支不确定时才升级至RAG后备分支,当两个分支均不可信时则放弃回答。然而,逐级校准此类级联策略可能过于保守,因为最终效用取决于仅含LLM分支与RAG分支的联合不确定性阈值。在本研究中,我们开发了BalanceRAG,用于在目标风险水平下认证阈值对。给定两个分支的不确定性分数,BalanceRAG将每个阈值对视为二维网格上的工作点,并通过序贯图形测试识别安全工作点。这实现了风险自适应的阈值校准,在控制已接受点中系统级错误率的同时保留更多样本。此外,BalanceRAG扩展至多风险校准,允许在约束检索使用量的同时,联合控制选择条件风险。在多个LLM主干网络上的三个开放域问答基准实验表明,与始终启用RAG的方案相比,BalanceRAG能达到指定风险水平、保持更高的覆盖率与更多被接受的正确样本,并减少不必要的检索调用。