With the increasing adoption of vision-language models (VLMs) in critical decision-making systems such as healthcare or autonomous driving, the calibration of their uncertainty estimates becomes paramount. Yet, this dimension has been largely underexplored in the VLM test-time prompt-tuning (TPT) literature, which has predominantly focused on improving their discriminative performance. Recent state-of-the-art advocates for enforcing full orthogonality over pairs of text prompt embeddings to enhance separability, and therefore calibration. Nevertheless, as we theoretically show in this work, the inherent gradients from fully orthogonal constraints will strongly push semantically related classes away, ultimately making the model overconfident. Based on our findings, we propose Semantic Orthogonal Calibration (SoC), a Huber-based regularizer that enforces smooth prototype separation while preserving semantic proximity, thereby improving calibration compared to prior orthogonality-based approaches. Across a comprehensive empirical validation, we demonstrate that SoC consistently improves calibration performance, while also maintaining competitive discriminative capabilities.
翻译:随着视觉语言模型在医疗健康、自动驾驶等关键决策系统中的日益广泛应用,其不确定性估计的校准变得至关重要。然而,在视觉语言模型的测试时提示调优文献中,这一维度尚未得到充分探索,现有研究主要集中于提升模型的判别性能。近期最先进的方法主张对文本提示嵌入对施加完全正交性约束以增强可分性,进而改善校准效果。然而,正如我们在本工作中从理论上证明的,完全正交约束产生的固有梯度会强烈地将语义相关的类别推离,最终导致模型过度自信。基于这一发现,我们提出了语义正交校准方法,该方法采用基于Huber损失的规范化器,在保持语义邻近性的同时实现平滑的原型分离,从而相比先前基于正交性的方法提升了校准性能。通过全面的实证验证,我们证明SoC能够持续改善校准表现,同时保持具有竞争力的判别能力。