Although deep learning models have shown impressive performance on supervised learning tasks, they often struggle to generalize well when the training (source) and test (target) domains differ. Unsupervised domain adaptation (DA) has emerged as a popular solution to this problem. However, current DA techniques rely on visual backbones, which may lack semantic richness. Despite the potential of large-scale vision-language foundation models like CLIP, their effectiveness for DA has yet to be fully explored. To address this gap, we introduce AD-CLIP, a domain-agnostic prompt learning strategy for CLIP that aims to solve the DA problem in the prompt space. We leverage the frozen vision backbone of CLIP to extract both image style (domain) and content information, which we apply to learn prompt tokens. Our prompts are designed to be domain-invariant and class-generalizable, by conditioning prompt learning on image style and content features simultaneously. We use standard supervised contrastive learning in the source domain, while proposing an entropy minimization strategy to align domains in the embedding space given the target domain data. We also consider a scenario where only target domain samples are available during testing, without any source domain data, and propose a cross-domain style mapping network to hallucinate domain-agnostic tokens. Our extensive experiments on three benchmark DA datasets demonstrate the effectiveness of AD-CLIP compared to existing literature.
翻译:尽管深度学习模型在监督学习任务中表现出色,但当训练(源域)与测试(目标域)领域存在差异时,它们往往难以很好地泛化。无监督域适应(DA)已成为解决该问题的流行方案。然而,当前的DA技术依赖于视觉骨干网络,这可能缺乏语义丰富性。尽管CLIP等大规模视觉-语言基础模型具有潜力,但其在DA中的有效性尚未得到充分探索。为填补这一空白,我们提出AD-CLIP——一种针对CLIP的域无关提示学习策略,旨在提示空间中解决DA问题。我们利用CLIP的冻结视觉骨干网络提取图像风格(域)和内容信息,并将其用于学习提示标记。通过同时以图像风格和内容特征为条件进行提示学习,我们的提示被设计为具有域不变性和类可泛化性。我们在源域中使用标准监督对比学习,同时提出一种熵最小化策略,以在嵌入空间中利用目标域数据实现域对齐。我们还考虑了一种仅测试期间提供目标域样本而无源域数据的场景,并提出一种跨域风格映射网络来生成域无关的幻象标记。在三个基准DA数据集上的大量实验表明,与现有文献相比,AD-CLIP具有有效性。