Accurately assessing model confidence is essential for deploying large language models (LLMs) in mission-critical factual domains. While retrieval-augmented generation (RAG) is widely adopted to improve grounding, confidence calibration in RAG settings remains poorly understood. We conduct a systematic study across four benchmarks, revealing that LLMs exhibit poor calibration performance especially when noisy contexts are retrieved. Specifically, contradictory or irrelevant evidence tends to exacerbate the model's overconfidence issue. To address this, we propose NOVA Rules (NOise-Aware Verbal Confidence CAlibration Rules) to provide a principled foundation for resolving overconfidence under noise. We further design NOVA, a noise-aware calibration framework that synthesizes supervision from ~2K HotpotQA examples guided by these rules. By performing supervised fine-tuning (SFT) with this data, NOVA equips models with intrinsic noise awareness without relying on stronger teacher models. Empirical results show that NOVA yields substantial gains, improving ECE scores by 10.9% in-domain and 8.0% out-of-domain. By bridging the gap between retrieval noise and verbal calibration, NOVA paves the way for both accurate and epistemically reliable LLMs.
翻译:摘要:在关键事实领域部署大语言模型(LLM)时,准确评估模型置信度至关重要。尽管检索增强生成(RAG)被广泛用于提升事实依据性,但RAG场景中的置信度校准仍缺乏深入理解。我们基于四项基准展开系统研究,揭示出尤其在检索到噪声上下文时,LLM表现出较差的校准性能。具体而言,矛盾或无关证据往往会加剧模型的过度自信问题。针对这一现象,我们提出NOVA规则(噪声感知型口头置信度校准规则),为噪声环境下解决过度自信问题提供规范化基础。进一步设计NOVA框架——一种基于约2000条HotpotQA样本、通过规则指导合成监督信号的噪声感知校准框架。通过对此数据进行监督微调(SFT),NOVA无需依赖更强教师模型即可赋予模型内在的噪声感知能力。实验结果表明,NOVA带来显著性能提升:域内评估中ECE分数改善10.9%,域外评估中改善8.0%。通过弥合检索噪声与口头校准之间的鸿沟,NOVA为构建兼具准确性与认知可靠性的大语言模型奠定了基础。