Universal domain adaptation (UniDA) aims to transfer knowledge from the source domain to the target domain without any prior knowledge about the label set. The challenge lies in how to determine whether the target samples belong to common categories. The mainstream methods make judgments based on the sample features, which overemphasizes global information while ignoring the most crucial local objects in the image, resulting in limited accuracy. To address this issue, we propose a Universal Attention Matching (UniAM) framework by exploiting the self-attention mechanism in vision transformer to capture the crucial object information. The proposed framework introduces a novel Compressive Attention Matching (CAM) approach to explore the core information by compressively representing attentions. Furthermore, CAM incorporates a residual-based measurement to determine the sample commonness. By utilizing the measurement, UniAM achieves domain-wise and category-wise Common Feature Alignment (CFA) and Target Class Separation (TCS). Notably, UniAM is the first method utilizing the attention in vision transformer directly to perform classification tasks. Extensive experiments show that UniAM outperforms the current state-of-the-art methods on various benchmark datasets.
翻译:通用领域自适应旨在将源域的知识迁移到目标域,且无需任何关于标签集的先验知识。其核心挑战在于如何判定目标样本是否属于公共类别。主流方法基于样本特征进行判断,这过度强调了全局信息而忽略了图像中最关键的局部对象,导致准确率受限。为解决此问题,我们提出了一种通用注意力匹配框架,通过利用视觉Transformer中的自注意力机制来捕获关键对象信息。该框架引入了一种新颖的压缩注意力匹配方法,通过压缩表示注意力来探索核心信息。此外,CAM采用基于残差的度量方式来判断样本的公共性。借助该度量,UniAM实现了领域级和类别级的公共特征对齐与目标类别分离。值得注意的是,UniAM是首个直接利用视觉Transformer中的注意力进行分类任务的方法。大量实验表明,UniAM在多个基准数据集上的性能均优于当前最先进方法。