Large Vision Language Models (LVLMs) achieve strong multimodal reasoning but frequently exhibit hallucinations and incorrect responses with high certainty, which hinders their usage in high-stakes domains. Existing verbalized confidence calibration methods, largely developed for text-only LLMs, typically optimize a single holistic confidence score using binary answer-level correctness. This design is mismatched to LVLMs: an incorrect prediction may arise from perceptual failures or from reasoning errors given correct perception, and a single confidence conflates these sources while visual uncertainty is often dominated by language priors. To address these issues, we propose VL-Calibration, a reinforcement learning framework that explicitly decouples confidence into visual and reasoning confidence. To supervise visual confidence without ground-truth perception labels, we introduce an intrinsic visual certainty estimation that combines (i) visual grounding measured by KL-divergence under image perturbations and (ii) internal certainty measured by token entropy. We further propose token-level advantage reweighting to focus optimization on tokens based on visual certainty, suppressing ungrounded hallucinations while preserving valid perception. Experiments on thirteen benchmarks show that VL-Calibration effectively improves calibration while boosting visual reasoning accuracy, and it generalizes to out-of-distribution benchmarks across model scales and architectures.
翻译:大型视觉语言模型(LVLMs)展现出强大的多模态推理能力,但常以高确定性产生幻觉与错误响应,这阻碍了其在高风险领域的应用。现有主要针对纯文本大语言模型开发的言语化置信度校准方法,通常使用二元答案级正确性来优化单一的整体置信度分数。此类设计与LVLMs不匹配:错误的预测可能源于感知失败,或在正确感知基础上发生的推理错误;单一置信度分数混淆了这些来源,而视觉不确定性往往被语言先验所主导。为解决这些问题,我们提出VL-校准,一种将置信度明确解耦为视觉置信度与推理置信度的强化学习框架。为在无真实感知标签情况下监督视觉置信度,我们引入一种内在视觉确定性估计,它结合了(i)通过图像扰动下的KL散度测量的视觉基础确定性,以及(ii)通过令牌熵测量的内在确定性。我们进一步提出令牌级优势重加权方法,根据视觉确定性优化不同令牌的权重,从而抑制无根据的幻觉,同时保留有效的感知。在十三个基准测试上的实验表明,VL-校准在提升视觉推理准确性的同时有效改善了校准性能,并能泛化至不同模型规模与架构的分布外基准测试。