Accurately assessing model confidence is essential for deploying large language models (LLMs) in mission-critical factual domains. While retrieval-augmented generation (RAG) is widely adopted to improve grounding, confidence calibration in RAG settings remains poorly understood. We conduct a systematic study across four benchmarks, revealing that LLMs exhibit poor calibration performance due to noisy retrieved contexts. Specifically, contradictory or irrelevant evidence tends to inflate the model's false certainty, leading to severe overconfidence. To address this, we propose NAACL Rules (Noise-AwAre Confidence CaLibration Rules) to provide a principled foundation for resolving overconfidence under noise. We further design NAACL, a noise-aware calibration framework that synthesizes supervision from about 2K HotpotQA examples guided by these rules. By performing supervised fine-tuning (SFT) with this data, NAACL equips models with intrinsic noise awareness without relying on stronger teacher models. Empirical results show that NAACL yields substantial gains, improving ECE scores by 10.9% in-domain and 8.0% out-of-domain. By bridging the gap between retrieval noise and verbal calibration, NAACL paves the way for both accurate and epistemically reliable LLMs.
翻译:准确评估模型置信度对于在关键事实领域部署大语言模型(LLMs)至关重要。尽管检索增强生成(RAG)被广泛采用以提升事实依据性,但RAG场景下的置信度校准机制仍未得到充分理解。我们在四个基准测试上进行了系统性研究,发现LLMs由于检索上下文的噪声而表现出较差的校准性能。具体而言,矛盾或无关的证据倾向于放大模型的错误确定性,导致严重的过度自信。为解决此问题,我们提出了NAACL规则(噪声感知置信度校准规则),为噪声下的过度自信问题提供了原则性解决方案。我们进一步设计了NAACL,一个噪声感知校准框架,该框架依据这些规则整合了约2000个HotpotQA示例的监督信号。通过使用该数据进行监督微调(SFT),NAACL使模型具备了内在的噪声感知能力,而无需依赖更强的教师模型。实证结果表明,NAACL带来了显著提升,在领域内将ECE分数改善了10.9%,在领域外改善了8.0%。通过弥合检索噪声与语言校准之间的鸿沟,NAACL为构建既准确又认知可靠的大语言模型开辟了道路。