Most domain adaptation (DA) methods are based on either a convolutional neural networks (CNNs) or a vision transformers (ViTs). They align the distribution differences between domains as encoders without considering their unique characteristics. For instance, ViT excels in accuracy due to its superior ability to capture global representations, while CNN has an advantage in capturing local representations. This fact has led us to design a hybrid method to fully take advantage of both ViT and CNN, called Explicitly Class-specific Boundaries (ECB). ECB learns CNN on ViT to combine their distinct strengths. In particular, we leverage ViT's properties to explicitly find class-specific decision boundaries by maximizing the discrepancy between the outputs of the two classifiers to detect target samples far from the source support. In contrast, the CNN encoder clusters target features based on the previously defined class-specific boundaries by minimizing the discrepancy between the probabilities of the two classifiers. Finally, ViT and CNN mutually exchange knowledge to improve the quality of pseudo labels and reduce the knowledge discrepancies of these models. Compared to conventional DA methods, our ECB achieves superior performance, which verifies its effectiveness in this hybrid model. The project website can be found https://dotrannhattuong.github.io/ECB/website.
翻译:大多数域自适应(DA)方法基于卷积神经网络(CNN)或视觉Transformer(ViT)。它们仅将域间分布差异作为编码器进行对齐,而忽略了各自特性。例如,ViT凭借其出色的全局表征捕获能力在准确率上表现优异,而CNN则在局部表征捕获方面具有优势。这一事实促使我们设计了一种名为显式类别边界(ECB)的混合方法,以充分利用ViT和CNN的优势。ECB在ViT上学习CNN,从而融合两者独特优势。具体而言,我们利用ViT的特性通过最大化两个分类器输出之间的差异来显式地寻找类别特定决策边界,从而检测远离源支持的目标样本。相比之下,CNN编码器通过最小化两个分类器概率之间的差异,基于先前定义的类别特定边界对目标特征进行聚类。最后,ViT和CNN相互交换知识,以提高伪标签质量并减少这些模型之间的知识差异。与传统DA方法相比,我们的ECB实现了更优的性能,验证了该混合模型的有效性。项目网站可访问:https://dotrannhattuong.github.io/ECB/website。