In deep learning, test-time adaptation has gained attention as a method for model fine-tuning without the need for labeled data. A prime exemplification is the recently proposed test-time prompt tuning for large-scale vision-language models such as CLIP. Unfortunately, these prompts have been mainly developed to improve accuracy, overlooking the importance of calibration-a crucial aspect for quantifying prediction uncertainty. However, traditional calibration methods rely on substantial amounts of labeled data, making them impractical for test-time scenarios. To this end, this paper explores calibration during test-time prompt tuning by leveraging the inherent properties of CLIP. Through a series of observations, we find that the prompt choice significantly affects the calibration in CLIP, where the prompts leading to higher text feature dispersion result in better-calibrated predictions. Introducing the Average Text Feature Dispersion (ATFD), we establish its relationship with calibration error and present a novel method, Calibrated Test-time Prompt Tuning (C-TPT), for optimizing prompts during test-time with enhanced calibration. Through extensive experiments on different CLIP architectures and datasets, we show that C-TPT can effectively improve the calibration of test-time prompt tuning without needing labeled data.
翻译:在深度学习中,测试时自适应作为一种无需标注数据即可进行模型微调的方法备受关注。典型示例是近期针对CLIP等大规模视觉-语言模型提出的测试时提示调优技术。然而,现有提示调优方法主要致力于提升准确率,忽视了校准(量化预测不确定性的关键要素)的重要性。传统校准方法依赖大量标注数据,难以适用于测试时场景。为此,本文通过利用CLIP的固有属性,探索测试时提示调优过程中的校准问题。通过系列观察发现,提示选择显著影响CLIP的校准效果,其中能带来更高文本特征分散度的提示可产生更优校准的预测结果。我们提出平均文本特征分散度(ATFD)指标,建立其与校准误差的关联,并创新性地提出校准式测试时提示调优(C-TPT)方法,在测试时以增强校准为目标优化提示。基于不同CLIP架构和数据集的广泛实验表明,C-TPT能在无需标注数据的前提下有效提升测试时提示调优的校准性能。