Recent work on Vision Transformers (VTs) showed that introducing a local inductive bias in the VT architecture helps reducing the number of samples necessary for training. However, the architecture modifications lead to a loss of generality of the Transformer backbone, partially contradicting the push towards the development of uniform architectures, shared, e.g., by both the Computer Vision and the Natural Language Processing areas. In this work, we propose a different and complementary direction, in which a local bias is introduced using an auxiliary self-supervised task, performed jointly with standard supervised training. Specifically, we exploit the observation that the attention maps of VTs, when trained with self-supervision, can contain a semantic segmentation structure which does not spontaneously emerge when training is supervised. Thus, we explicitly encourage the emergence of this spatial clustering as a form of training regularization. In more detail, we exploit the assumption that, in a given image, objects usually correspond to few connected regions, and we propose a spatial formulation of the information entropy to quantify this object-based inductive bias. By minimizing the proposed spatial entropy, we include an additional self-supervised signal during training. Using extensive experiments, we show that the proposed regularization leads to equivalent or better results than other VT proposals which include a local bias by changing the basic Transformer architecture, and it can drastically boost the VT final accuracy when using small-medium training sets. The code is available at https://github.com/helia95/SAR.
翻译:近期关于视觉Transformer(VTs)的研究表明,在VT架构中引入局部归纳偏置有助于减少训练所需的样本数量。然而,架构修改导致Transformer骨干网络通用性下降,部分违背了推动计算机视觉与自然语言处理领域共享统一架构的发展方向。本文提出一种不同且互补的方案,通过联合标准监督训练的自监督辅助任务引入局部偏置。具体而言,我们观察到:当使用自监督训练时,VT的注意力图可包含语义分割结构,而监督训练中这种结构不会自发涌现。基于此,我们显式鼓励这种空间聚类作为训练正则化形式。更详细地说,我们基于这样的假设:在给定图像中,物体通常对应少数连通区域,并据此提出信息熵的空间形式来量化这种基于物体的归纳偏置。通过最小化所提出的空间熵,我们在训练过程中引入额外自监督信号。大量实验表明,相比通过修改基础Transformer架构引入局部偏置的其他VT方案,所提正则化方法能获得同等或更优结果,并在使用中小规模训练集时显著提升VT最终精度。代码已公开于https://github.com/helia95/SAR。