Vision-language models (VLMs), e.g., CLIP, have shown remarkable potential in zero-shot image classification. However, adapting these models to new domains remains challenging, especially in unsupervised settings where labeled data is unavailable. Recent research has proposed pseudo-labeling approaches to adapt CLIP in an unsupervised manner using unlabeled target data. Nonetheless, these methods struggle due to noisy pseudo-labels resulting from the misalignment between CLIP's visual and textual representations. This study introduces DPA, an unsupervised domain adaptation method for VLMs. DPA introduces the concept of dual prototypes, acting as distinct classifiers, along with the convex combination of their outputs, thereby leading to accurate pseudo-label construction. Next, it ranks pseudo-labels to facilitate robust self-training, particularly during early training. Finally, it addresses visual-textual misalignment by aligning textual prototypes with image prototypes to further improve the adaptation performance. Experiments on 13 downstream vision tasks demonstrate that DPA significantly outperforms zero-shot CLIP and the state-of-the-art unsupervised adaptation baselines.
翻译:视觉-语言模型(如CLIP)在零样本图像分类任务中展现出卓越潜力。然而,将这些模型适配到新领域仍具挑战性,尤其是在缺乏标注数据的无监督场景下。近期研究提出了基于伪标注的方法,利用未标注目标数据对CLIP进行无监督适配。然而,由于CLIP视觉与文本表征间的失配导致伪标注噪声问题,现有方法仍面临困难。本研究提出DPA——一种面向视觉-语言模型的无监督领域适应方法。DPA引入双原型概念作为独立分类器,通过对其输出进行凸组合实现精确的伪标签构建。随后,该方法对伪标签进行排序以支撑鲁棒的自训练过程,尤其在训练初期。最后,通过将文本原型与图像原型对齐以缓解视觉-文本表征失配问题,从而进一步提升适配性能。在13个下游视觉任务上的实验表明,DPA显著优于零样本CLIP及当前最先进的无监督适应基线方法。