The Contrastive Language-Image Pre-Training (CLIP) model excels in few-shot learning by aligning visual and textual representations. Our study shows that template-sample similarity (TSS), defined as the resemblance between a text template and an image sample, introduces bias. This bias leads the model to rely on template proximity rather than true sample-to-category alignment, reducing both accuracy and robustness in classification. We present a framework that uses empty prompts, textual inputs that convey the idea of "emptiness" without category information. These prompts capture unbiased template features and offset TSS bias. The framework employs two stages. During pre-training, empty prompts reveal and reduce template-induced bias within the CLIP encoder. During few-shot fine-tuning, a bias calibration loss enforces correct alignment between images and their categories, ensuring the model focuses on relevant visual cues. Experiments across multiple benchmarks demonstrate that our template correction method significantly reduces performance fluctuations caused by TSS, yielding higher classification accuracy and stronger robustness. The repository of this project is available at https://github.com/zhenyuZ-HUST/Decoupling-Template-Bias-in-CLIP.


翻译:对比语言-图像预训练(CLIP)模型通过对齐视觉与文本表征,在小样本学习中表现优异。本研究揭示,模板-样本相似度(TSS)——即文本模板与图像样本之间的相似性——会引入偏差。该偏差导致模型依赖模板邻近性而非真实的样本-类别对齐,从而降低分类的准确性与鲁棒性。我们提出一种利用空提示的框架,空提示为不含类别信息、仅传达“空无”概念的文本输入。这些提示能捕获无偏的模板特征并抵消TSS偏差。该框架包含两个阶段:预训练阶段,空提示揭示并减少CLIP编码器中的模板诱导偏差;小样本微调阶段,通过偏差校准损失强制图像与其类别正确对齐,确保模型聚焦于相关视觉特征。在多个基准测试上的实验表明,我们的模板校正方法显著降低了由TSS引起的性能波动,实现了更高的分类准确率和更强的鲁棒性。项目代码库已发布于 https://github.com/zhenyuZ-HUST/Decoupling-Template-Bias-in-CLIP。

0
下载
关闭预览

相关内容

Top
微信扫码咨询专知VIP会员