Source-Free Domain Adaptation (SFDA) aims to adapt a source model for a target domain, with only access to unlabeled target training data and the source model pre-trained on a supervised source domain. Relying on pseudo labeling and/or auxiliary supervision, conventional methods are inevitably error-prone. To mitigate this limitation, in this work we for the first time explore the potentials of off-the-shelf vision-language (ViL) multimodal models (e.g.,CLIP) with rich whilst heterogeneous knowledge. We find that directly applying the ViL model to the target domain in a zero-shot fashion is unsatisfactory, as it is not specialized for this particular task but largely generic. To make it task specific, we propose a novel Distilling multimodal Foundation model(DIFO)approach. Specifically, DIFO alternates between two steps during adaptation: (i) Customizing the ViL model by maximizing the mutual information with the target model in a prompt learning manner, (ii) Distilling the knowledge of this customized ViL model to the target model. For more fine-grained and reliable distillation, we further introduce two effective regularization terms, namely most-likely category encouragement and predictive consistency. Extensive experiments show that DIFO significantly outperforms the state-of-the-art alternatives. Code is here
翻译:无源域适应(SFDA)旨在仅利用无标签的目标域训练数据和在监督源域上预训练的源模型,将源模型适配至目标域。由于依赖伪标签和/或辅助监督,传统方法不可避免地存在误差倾向。为缓解这一局限,本文首次探索利用现成的视觉-语言(ViL)多模态模型(如CLIP)的潜力,这些模型具备丰富但异质的知识。我们发现,直接将ViL模型以零样本方式应用于目标域效果欠佳,因其虽具有广泛通用性,但未针对特定任务进行专门化。为使模型任务特化,我们提出一种新颖的蒸馏多模态基础模型方法(DIFO)。具体而言,DIFO在适配过程中交替执行两个步骤:(i)通过最大化与目标模型之间的互信息,以提示学习方式定制ViL模型;(ii)将定制化ViL模型的知识蒸馏至目标模型。为实现更细粒度且可靠的蒸馏,我们进一步引入两项有效的正则化项,即最可能类别鼓励项和预测一致性项。大量实验表明,DIFO显著优于当前最先进的替代方法。代码已开源。