Large vision-language models (VLMs) like CLIP have demonstrated good zero-shot learning performance in the unsupervised domain adaptation task. Yet, most transfer approaches for VLMs focus on either the language or visual branches, overlooking the nuanced interplay between both modalities. In this work, we introduce a Unified Modality Separation (UniMoS) framework for unsupervised domain adaptation. Leveraging insights from modality gap studies, we craft a nimble modality separation network that distinctly disentangles CLIP's features into language-associated and vision-associated components. Our proposed Modality-Ensemble Training (MET) method fosters the exchange of modality-agnostic information while maintaining modality-specific nuances. We align features across domains using a modality discriminator. Comprehensive evaluations on three benchmarks reveal our approach sets a new state-of-the-art with minimal computational costs. Code: https://github.com/TL-UESTC/UniMoS
翻译:大型视觉语言模型(如CLIP)在无监督域适应任务中展现出良好的零样本学习性能。然而,大多数针对VLMs的迁移方法仅关注语言或视觉分支,忽视了两者模态之间的微妙交互。本文提出了一种统一模态分离(UniMoS)框架用于无监督域适应。借鉴模态间隙研究的洞见,我们设计了一个轻量化的模态分离网络,能够将CLIP特征清晰解耦为语言关联和视觉关联成分。我们提出的模态集成训练(MET)方法促进了模态无关信息的交换,同时保留了模态特定细节。通过模态判别器实现跨域特征对齐。在三个基准数据集上的综合评估表明,我们的方法以极小的计算成本刷新了最新性能水平。代码:https://github.com/TL-UESTC/UniMoS