In deep learning, test-time adaptation has gained attention as a method for model fine-tuning without the need for labeled data. A prime exemplification is the recently proposed test-time prompt tuning for large-scale vision-language models such as CLIP. Unfortunately, these prompts have been mainly developed to improve accuracy, overlooking the importance of calibration, which is a crucial aspect for quantifying prediction uncertainty. However, traditional calibration methods rely on substantial amounts of labeled data, making them impractical for test-time scenarios. To this end, this paper explores calibration during test-time prompt tuning by leveraging the inherent properties of CLIP. Through a series of observations, we find that the prompt choice significantly affects the calibration in CLIP, where the prompts leading to higher text feature dispersion result in better-calibrated predictions. Introducing the Average Text Feature Dispersion (ATFD), we establish its relationship with calibration error and present a novel method, Calibrated Test-time Prompt Tuning (C-TPT), for optimizing prompts during test-time with enhanced calibration. Through extensive experiments on different CLIP architectures and datasets, we show that C-TPT can effectively improve the calibration of test-time prompt tuning without needing labeled data. The code is publicly accessible at https://github.com/hee-suk-yoon/C-TPT.
翻译:在深度学习中,无需标注数据的模型微调方法——测试时自适应已引起广泛关注。其典型代表是近期针对大规模视觉-语言模型(如CLIP)提出的测试时提示调优技术。然而,现有提示方法主要致力于提升准确率,忽视了作为预测不确定性量化关键指标的校准性能。传统校准方法依赖大量标注数据,难以应用于测试时场景。为此,本文通过挖掘CLIP模型固有特性,探索了测试时提示调优中的校准问题。通过系列观测发现,提示选择显著影响CLIP的校准效果:能产生更高文本特征离散度的提示可带来更优的校准预测结果。我们提出平均文本特征离散度(ATFD)指标,建立其与校准误差的关联,并创新性地提出校准式测试时提示调优(C-TPT)方法,用于在测试阶段优化提示以增强校准性能。基于不同CLIP架构与数据集的实验表明,C-TPT能在无需标注数据的情况下有效改善测试时提示调优的校准效果。相关代码已开源至https://github.com/hee-suk-yoon/C-TPT。