Endeavors have been recently made to leverage the vision transformer (ViT) for the challenging unsupervised domain adaptation (UDA) task. They typically adopt the cross-attention in ViT for direct domain alignment. However, as the performance of cross-attention highly relies on the quality of pseudo labels for targeted samples, it becomes less effective when the domain gap becomes large. We solve this problem from a game theory's perspective with the proposed model dubbed as PMTrans, which bridges source and target domains with an intermediate domain. Specifically, we propose a novel ViT-based module called PatchMix that effectively builds up the intermediate domain, i.e., probability distribution, by learning to sample patches from both domains based on the game-theoretical models. This way, it learns to mix the patches from the source and target domains to maximize the cross entropy (CE), while exploiting two semi-supervised mixup losses in the feature and label spaces to minimize it. As such, we interpret the process of UDA as a min-max CE game with three players, including the feature extractor, classifier, and PatchMix, to find the Nash Equilibria. Moreover, we leverage attention maps from ViT to re-weight the label of each patch by its importance, making it possible to obtain more domain-discriminative feature representations. We conduct extensive experiments on four benchmark datasets, and the results show that PMTrans significantly surpasses the ViT-based and CNN-based SoTA methods by +3.6% on Office-Home, +1.4% on Office-31, and +17.7% on DomainNet, respectively.
翻译:近年来,研究者开始尝试利用视觉Transformer(ViT)解决具有挑战性的无监督域适应(UDA)任务。现有方法通常采用ViT中的交叉注意力机制进行直接域对齐。然而,由于交叉注意力的性能高度依赖于目标样本伪标签的质量,当域间差距较大时该方法效果显著下降。本文从博弈论视角解决该问题,提出名为PMTrans的模型,通过构建中间域来桥接源域与目标域。具体而言,我们设计了一种基于ViT的新型模块PatchMix,该模块基于博弈论模型学习从两个域中采样图像块,从而有效构建中间域的概率分布。通过这种方式,模型在特征空间和标签空间中利用两种半监督混合损失最小化交叉熵(CE)的同时,通过混合源域与目标域图像块来最大化CE。由此,我们将UDA过程解释为包含特征提取器、分类器和PatchMix三方的最小-最大交叉熵博弈,最终寻找纳什均衡。此外,我们利用ViT的注意力图对每个图像块按其重要度重新赋权,从而获得更具域判别性的特征表示。在四个基准数据集上的实验结果表明,PMTrans在Office-Home上超越ViT和CNN基线最先进方法3.6%,在Office-31上超越1.4%,在DomainNet上超越17.7%。