Recent work on Vision Transformers (VTs) showed that introducing a local inductive bias in the VT architecture helps reducing the number of samples necessary for training. However, the architecture modifications lead to a loss of generality of the Transformer backbone, partially contradicting the push towards the development of uniform architectures, shared, e.g., by both the Computer Vision and the Natural Language Processing areas. In this work, we propose a different and complementary direction, in which a local bias is introduced using an auxiliary self-supervised task, performed jointly with standard supervised training. Specifically, we exploit the observation that the attention maps of VTs, when trained with self-supervision, can contain a semantic segmentation structure which does not spontaneously emerge when training is supervised. Thus, we explicitly encourage the emergence of this spatial clustering as a form of training regularization. In more detail, we exploit the assumption that, in a given image, objects usually correspond to few connected regions, and we propose a spatial formulation of the information entropy to quantify this object-based inductive bias. By minimizing the proposed spatial entropy, we include an additional self-supervised signal during training. Using extensive experiments, we show that the proposed regularization leads to equivalent or better results than other VT proposals which include a local bias by changing the basic Transformer architecture, and it can drastically boost the VT final accuracy when using small-medium training sets. The code is available at https://github.com/helia95/SAR.
翻译:近期关于视觉Transformer(VTs)的研究表明,在VT架构中引入局部归纳偏置有助于减少训练所需样本数量。然而,架构上的修改会导致Transformer主干网络通用性的损失,这与计算机视觉和自然语言处理领域共同推动的通用架构发展方向部分矛盾。本文提出一种不同的互补方案,通过辅助自监督任务(与标准监督训练并行执行)引入局部偏置。具体而言,我们利用一个观察现象:当通过自监督方式训练时,VT的注意力图能够包含语义分割结构,而这种结构在监督训练模式下不会自发产生。因此,我们显式地鼓励这种空间聚类作为训练正则化形式。更详细地说,我们基于"图像中物体通常对应少数连通区域"这一假设,提出信息熵的空间形式来量化这种基于物体的归纳偏置。通过最小化所提出的空间熵,我们在训练过程中引入了额外的自监督信号。通过大量实验表明,所提出的正则化方法在效果上等同于甚至优于通过修改基础Transformer架构引入局部偏置的其他VT方案,且能在使用中小规模训练集时显著提升VT的最终准确率。代码开源地址:https://github.com/helia95/SAR。