Recent advances in Multi-modal Large Language Models (MLLMs) have predominantly focused on enhancing visual perception to improve accuracy. However, a critical question remains unexplored: Do models know when they do not know? Through a probing experiment, we reveal a severe confidence miscalibration problem in MLLMs. To address this, we propose Confidence-Driven Reinforcement Learning (CDRL), which uses original-noise image pairs and a novel confidence-based reward to enhance perceptual sensitivity and robustly calibrate the model's confidence. Beyond training benefits, calibrated confidence enables more effective test-time scaling as a free lunch. We further propose Confidence-Aware Test-Time Scaling (CA-TTS), which dynamically coordinates Self-Consistency, Self-Reflection, and Visual Self-Check modules guided by confidence signals. An Expert Model acts in multiple roles (e.g., Planner, Critic, Voter) to schedule these modules and provide external verification. Our integrated framework establishes new state-of-the-art results with consistent 8.8% gains across four benchmarks. More ablation studies demonstrate the effectiveness of each module and scaling superiority.
翻译:近期,多模态大语言模型(MLLMs)的研究进展主要集中于通过增强视觉感知来提高准确性。然而,一个关键问题尚未被探索:模型是否知道它们何时不知道?通过一项探测性实验,我们揭示了MLLMs中存在的严重置信度校准问题。为解决此问题,我们提出了置信度驱动的强化学习(CDRL),该方法利用原始-噪声图像对和一种新颖的基于置信度的奖励,以增强感知敏感性并稳健地校准模型的置信度。除了训练阶段的优势,校准后的置信度能够作为一种“免费午餐”,在测试阶段实现更有效的缩放。我们进一步提出了置信度感知的测试时缩放(CA-TTS),该方法在置信度信号的引导下,动态协调自洽性、自反思和视觉自检模块。一个专家模型扮演多重角色(例如,规划器、评判器、投票器)来调度这些模块并提供外部验证。我们集成的框架在四个基准测试中取得了新的最先进结果,实现了8.8%的稳定提升。更多的消融研究证明了每个模块的有效性以及缩放方案的优越性。