Universal domain adaptation (UniDA) aims to transfer knowledge from the source domain to the target domain without any prior knowledge about the label set. The challenge lies in how to determine whether the target samples belong to common categories. The mainstream methods make judgments based on the sample features, which overemphasizes global information while ignoring the most crucial local objects in the image, resulting in limited accuracy. To address this issue, we propose a Universal Attention Matching (UniAM) framework by exploiting the self-attention mechanism in vision transformer to capture the crucial object information. The proposed framework introduces a novel Compressive Attention Matching (CAM) approach to explore the core information by compressively representing attentions. Furthermore, CAM incorporates a residual-based measurement to determine the sample commonness. By utilizing the measurement, UniAM achieves domain-wise and category-wise Common Feature Alignment (CFA) and Target Class Separation (TCS). Notably, UniAM is the first method utilizing the attention in vision transformer directly to perform classification tasks. Extensive experiments show that UniAM outperforms the current state-of-the-art methods on various benchmark datasets.
翻译:通用域自适应(UniDA)旨在无需任何关于标签集的先验知识,将源域知识迁移至目标域。其挑战在于如何判定目标样本是否属于公共类别。主流方法基于样本特征进行判断,这过度强调全局信息而忽略了图像中最关键的局部对象,导致精度受限。为解决此问题,我们提出通用注意力匹配(UniAM)框架,通过利用视觉Transformer中的自注意力机制捕获关键对象信息。该框架引入了一种新颖的压缩注意力匹配(CAM)方法,通过压缩表示注意力来探索核心信息。此外,CAM融入基于残差的度量以判定样本的公共性。利用该度量,UniAM实现了域级与类别级的公共特征对齐(CFA)及目标类分离(TCS)。值得注意的是,UniAM是首个直接利用视觉Transformer中的注意力执行分类任务的方法。大量实验表明,UniAM在多个基准数据集上均优于当前最先进的方法。