Prompt Disentanglement via Language Guidance and Representation Alignment for Domain Generalization

Domain Generalization (DG) seeks to develop a versatile model capable of performing effectively on unseen target domains. Notably, recent advances in pre-trained Visual Foundation Models (VFMs), such as CLIP, have demonstrated considerable potential in enhancing the generalization capabilities of deep learning models. Despite the increasing attention toward VFM-based domain prompt tuning within DG, the effective design of prompts capable of disentangling invariant features across diverse domains remains a critical challenge. In this paper, we propose addressing this challenge by leveraging the controllable and flexible language prompt of the VFM. Noting that the text modality of VFMs is naturally easier to disentangle, we introduce a novel framework for text feature-guided visual prompt tuning. This framework first automatically disentangles the text prompt using a large language model (LLM) and then learns domain-invariant visual representation guided by the disentangled text feature. However, relying solely on language to guide visual feature disentanglement has limitations, as visual features can sometimes be too complex or nuanced to be fully captured by descriptive text. To address this, we introduce Worst Explicit Representation Alignment (WERA), which extends text-guided visual prompts by incorporating an additional set of abstract prompts. These prompts enhance source domain diversity through stylized image augmentations, while alignment constraints ensure that visual representations remain consistent across both the original and augmented distributions. Experiments conducted on major DG datasets, including PACS, VLCS, OfficeHome, DomainNet, and TerraInc, demonstrate that our proposed method outperforms state-of-the-art DG methods.

翻译：领域泛化旨在开发一种能够在未见过的目标域上有效执行的通用模型。值得注意的是，近期基于预训练视觉基础模型（如CLIP）的进展已展现出增强深度学习模型泛化能力的巨大潜力。尽管当前基于视觉基础模型的域提示调优在领域泛化领域日益受到关注，但如何设计能有效解耦跨域不变特征的提示仍是一个关键挑战。本文提出利用视觉基础模型可控且灵活的语言提示来应对这一挑战。针对视觉基础模型文本模态天然更易解耦的特性，我们引入了一个文本特征引导的视觉提示调优新框架。该框架首先通过大语言模型自动解耦文本提示，随后基于解耦后的文本特征学习域不变视觉表征。然而，仅依赖语言引导视觉特征解耦存在局限——视觉特征有时过于复杂或微妙，难以被描述性文本完全捕捉。为此，我们提出最劣显式表征对齐方法，通过引入额外抽象提示集扩展文本引导的视觉提示：一方面通过风格化图像增强提升源域多样性，另一方面利用对齐约束确保视觉表征在原始分布与增强分布间保持一致性。在PACS、VLCS、OfficeHome、DomainNet和TerraInc等主流领域泛化数据集上的实验表明，我们提出的方法性能优于当前最优的领域泛化方法。