Universal domain adaptation (UniDA) aims to transfer knowledge from the source domain to the target domain without any prior knowledge about the label set. The challenge lies in how to determine whether the target samples belong to common categories. The mainstream methods make judgments based on the sample features, which overemphasizes global information while ignoring the most crucial local objects in the image, resulting in limited accuracy. To address this issue, we propose a Universal Attention Matching (UniAM) framework by exploiting the self-attention mechanism in vision transformer to capture the crucial object information. The proposed framework introduces a novel Compressive Attention Matching (CAM) approach to explore the core information by compressively representing attentions. Furthermore, CAM incorporates a residual-based measurement to determine the sample commonness. By utilizing the measurement, UniAM achieves domain-wise and category-wise Common Feature Alignment (CFA) and Target Class Separation (TCS). Notably, UniAM is the first method utilizing the attention in vision transformer directly to perform classification tasks. Extensive experiments show that UniAM outperforms the current state-of-the-art methods on various benchmark datasets.
翻译:通用域自适应(UniDA)旨在将知识从源域迁移至目标域,且无需任何关于标签集的先验知识。其核心挑战在于如何判定目标样本是否属于公共类别。当前主流方法基于样本特征进行判断,过度强调全局信息而忽略了图像中最关键的局部对象,导致精度受限。为解决该问题,我们提出通用注意力匹配(UniAM)框架,通过利用视觉Transformer中的自注意力机制捕获关键对象信息。该框架引入了一种新颖的压缩注意力匹配(CAM)方法,通过压缩表示注意力来挖掘核心信息。此外,CAM采用基于残差的度量判定样本的公共性。借助该度量,UniAM实现了域级与类别级的公共特征对齐(CFA)与目标类分离(TCS)。值得注意的是,UniAM是首个直接利用视觉Transformer中的注意力机制执行分类任务的方法。大量实验表明,UniAM在多个基准数据集上的性能均超越当前最先进方法。