Confidence calibration is an emerging challenge in real-world decision systems based on foundations models when used for downstream vision classification tasks. Due to various reasons exposed, logit scores on the CLIP head remain large irrespective of whether the image-language pairs reconcile. It is difficult to address in data space, given the few-shot regime. We propose a penalty incorporated into loss objective that penalizes incorrect classifications whenever one is made during finetuning, by moving an amount of log-likelihood to the true class commensurate to the relative amplitudes of the two likelihoods. We refer to it as \textit{confidence misalignment penalty (CMP)}. Extensive experiments on $12$ vision datasets and $5$ domain generalization datasets supports the calibration performance of our method against stat-of-the-art. CMP outperforms the benchmarked prompt learning methods, demonstrating average improvement in Expected Calibration Error (ECE) by average $6.01$\%, $4.01$ \% at minimum and $9.72$\% at maximum.
翻译:基于基础模型的下游视觉分类任务在实际决策系统中面临置信度校准这一新兴挑战。由于多种已知原因,CLIP头部的逻辑分数始终较大,无论图像-语言对是否匹配。在少样本条件下,该问题难以在数据空间中解决。我们提出一种融入损失目标的惩罚项,通过在微调过程中对错误分类施加惩罚,将一定量的对数似然按两个似然值的相对幅度比例转移至真实类别。我们将其称为\textit{置信失准惩罚(CMP)}。在$12$个视觉数据集和$5$个领域泛化数据集上的大量实验表明,本方法的校准性能优于当前最优技术。CMP在预期校准误差(ECE)指标上平均提升$6.01$\%,最低提升$4.01$\%,最高提升$9.72$\%,显著超越基准提示学习方法。