Conventional Unsupervised Domain Adaptation (UDA) strives to minimize distribution discrepancy between domains, which neglects to harness rich semantics from data and struggles to handle complex domain shifts. A promising technique is to leverage the knowledge of large-scale pre-trained vision-language models for more guided adaptation. Despite some endeavors, current methods often learn textual prompts to embed domain semantics for source and target domains separately and perform classification within each domain, limiting cross-domain knowledge transfer. Moreover, prompting only the language branch lacks flexibility to adapt both modalities dynamically. To bridge this gap, we propose Domain-Agnostic Mutual Prompting (DAMP) to exploit domain-invariant semantics by mutually aligning visual and textual embeddings. Specifically, the image contextual information is utilized to prompt the language branch in a domain-agnostic and instance-conditioned way. Meanwhile, visual prompts are imposed based on the domain-agnostic textual prompt to elicit domain-invariant visual embeddings. These two branches of prompts are learned mutually with a cross-attention module and regularized with a semantic-consistency loss and an instance-discrimination contrastive loss. Experiments on three UDA benchmarks demonstrate the superiority of DAMP over state-of-the-art approaches.
翻译:传统无监督域适应(UDA)致力于最小化域间分布差异,但未能充分利用数据中的丰富语义信息,且难以应对复杂的域偏移。一种有效策略是借助大规模预训练视觉语言模型的知识实现更具引导性的适应。尽管已有相关探索,现有方法通常分别学习嵌入源域与目标域语义的文本提示词,并在各域内独立分类,限制了跨域知识迁移。此外,仅对语言分支进行提示缺乏动态调整两种模态的灵活性。为弥补这一不足,我们提出领域无关互促提示(DAMP),通过视觉与文本嵌入的相互对齐来挖掘域不变语义。具体而言,图像上下文信息以领域无关且实例条件化的方式引导语言分支的提示学习;同时,基于领域无关文本提示施加视觉提示,以激发域不变视觉嵌入。这两个分支的提示通过交叉注意力模块相互学习,并利用语义一致性损失与实例判别对比损失进行正则化。在三个UDA基准上的实验表明,DAMP显著优于现有最先进方法。