Vision-language models (VLMs), e.g., CLIP, have shown remarkable potential in zero-shot image classification. However, adapting these models to new domains remains challenging, especially in unsupervised settings where labelled data is unavailable. Recent research has proposed pseudo-labelling approaches to adapt CLIP in an unsupervised manner using unlabelled target data. Nonetheless, these methods struggle due to noisy pseudo-labels resulting from the misalignment between CLIP's visual and textual representations. This study introduces DPA, an unsupervised domain adaptation method for VLMs. DPA introduces the concept of dual prototypes, acting as distinct classifiers, along with the convex combination of their outputs, thereby leading to accurate pseudo-label construction. Next, it ranks pseudo-labels to facilitate robust self-training, particularly during early training. Finally, it addresses visual-textual misalignment by aligning textual prototypes with image prototypes to further improve the adaptation performance. Experiments on 13 downstream vision tasks demonstrate that DPA significantly outperforms zero-shot CLIP and the state-of-the-art unsupervised adaptation baselines.
翻译:视觉语言模型(如CLIP)在零样本图像分类任务中展现出卓越潜力。然而,将这些模型适配到新领域仍具挑战性,尤其是在缺乏标注数据的无监督场景中。近期研究提出基于伪标注的方法,利用未标注目标数据对CLIP进行无监督适配。但由于CLIP视觉与文本表征间的失准会导致伪标注噪声,这些方法仍面临困难。本研究提出DPA——一种面向视觉语言模型的无监督域适配方法。DPA引入双原型概念作为独立分类器,并结合其输出的凸组合,从而构建精确的伪标注。随后,该方法通过对伪标注进行排序以增强鲁棒的自训练机制,尤其在训练初期。最后,通过将文本原型与图像原型对齐以缓解视觉-文本失准问题,进一步提升适配性能。在13个下游视觉任务上的实验表明,DPA显著优于零样本CLIP及当前最先进的无监督适配基线方法。