Universal domain adaptation (UniDA) transfers knowledge from a labeled source domain to an unlabeled target domain, where label spaces may differ and the target domain may contain private classes. Previous UniDA methods primarily focused on visual space alignment but often struggled with visual ambiguities due to content differences, which limited their robustness and generalizability. To overcome this, we introduce a novel approach that leverages the strong \textit{zero-shot capabilities} of recent vision-language foundation models (VLMs) like CLIP, concentrating solely on label space alignment to enhance adaptation stability. CLIP can generate task-specific classifiers based only on label names. However, adapting CLIP to UniDA is challenging because the label space is not fully known in advance. In this study, we first utilize generative vision-language models to identify unknown categories in the target domain. Noise and semantic ambiguities in the discovered labels -- such as those similar to source labels (e.g., synonyms, hypernyms, hyponyms) -- complicate label alignment. To address this, we propose a training-free label-space alignment method for UniDA (\ours). Our method aligns label spaces instead of visual spaces by filtering and refining noisy labels between the domains. We then construct a \textit{universal classifier} that integrates both shared knowledge and target-private class information, thereby improving generalizability under domain shifts. The results reveal that the proposed method considerably outperforms existing UniDA techniques across key DomainBed benchmarks, delivering an average improvement of \textcolor{blue}{+7.9\%}in H-score and \textcolor{blue}{+6.1\%} in H$^3$-score. Furthermore, incorporating self-training further enhances performance and achieves an additional (\textcolor{blue}{+1.6\%}) increment in both H- and H$^3$-scores.
翻译:通用域适应(UniDA)旨在将知识从已标注的源域迁移到未标注的目标域,其中两个域的标签空间可能不同,且目标域可能包含私有类别。现有的UniDA方法主要侧重于视觉空间的对齐,但由于内容差异导致的视觉模糊性,这些方法在鲁棒性和泛化能力上往往受限。为克服这一局限,我们提出一种新方法,利用如CLIP等近期视觉-语言基础模型(VLMs)强大的**零样本能力**,仅专注于标签空间的对齐以提升适应稳定性。CLIP能够仅基于标签名称生成任务特定的分类器。然而,由于标签空间并非预先完全已知,将CLIP适配到UniDA具有挑战性。在本研究中,我们首先利用生成式视觉-语言模型识别目标域中的未知类别。所发现标签中的噪声和语义模糊性——例如与源域标签相似的标签(如同义词、上位词、下位词)——使得标签对齐变得复杂。为解决此问题,我们提出一种用于UniDA的无需训练的标签空间对齐方法(\ours)。我们的方法通过筛选和精炼域间的噪声标签来对齐标签空间,而非视觉空间。随后,我们构建一个**通用分类器**,该分类器整合了共享知识以及目标域私有类别的信息,从而提升了在域偏移下的泛化能力。实验结果表明,所提方法在关键的DomainBed基准测试中显著优于现有的UniDA技术,在H-score上平均提升\textcolor{blue}{+7.9\%},在H$^3$-score上平均提升\textcolor{blue}{+6.1\%}。此外,结合自训练进一步提升了性能,在H-score和H$^3$-score上均实现了额外的(\textcolor{blue}{+1.6\%})增长。