Fully test-time adaptation aims to adapt the network model based on sequential analysis of input samples during the inference stage to address the cross-domain performance degradation problem of deep neural networks. This work is based on the following interesting finding: in transformer-based image classification, the class token at the first transformer encoder layer can be learned to capture the domain-specific characteristics of target samples during test-time adaptation. This learned token, when combined with input image patch embeddings, is able to gradually remove the domain-specific information from the feature representations of input samples during the transformer encoding process, thereby significantly improving the test-time adaptation performance of the source model across different domains. We refer to this class token as visual conditioning token (VCT). To successfully learn the VCT, we propose a bi-level learning approach to capture the long-term variations of domain-specific characteristics while accommodating local variations of instance-specific characteristics. Experimental results on the benchmark datasets demonstrate that our proposed bi-level visual conditioning token learning method is able to achieve significantly improved test-time adaptation performance by up to 1.9%.
翻译:全测试时适应旨在基于推理阶段对输入样本的序列分析来调整网络模型,以解决深度神经网络的跨域性能退化问题。本工作基于以下有趣发现:在基于Transformer的图像分类中,第一个Transformer编码器层中的类别标记可以在测试时适应过程中学习捕获目标样本的域特定特征。该学习到的标记与输入图像块嵌入相结合,能够在Transformer编码过程中逐步从输入样本的特征表示中移除域特定信息,从而显著提升源模型在不同域间的测试时适应性能。我们将此类别标记称为视觉条件标记。为成功学习视觉条件标记,我们提出了一种双层学习方法,以捕获域特定特征的长期变化,同时适应实例特定特征的局部变化。在基准数据集上的实验结果表明,我们提出的双层视觉条件标记学习方法能够将测试时适应性能显著提升高达1.9%。