The success of Vision Transformer (ViT) has been widely reported on a wide range of image recognition tasks. The merit of ViT over CNN has been largely attributed to large training datasets or auxiliary pre-training. Without pre-training, the performance of ViT on small datasets is limited because the global self-attention has limited capacity in local modeling. Towards boosting ViT on small datasets without pre-training, this work improves its local modeling by applying a weight mask on the original self-attention matrix. A straightforward way to locally adapt the self-attention matrix can be realized by an element-wise learnable weight mask (ELM), for which our preliminary results show promising results. However, the element-wise simple learnable weight mask not only induces a non-trivial additional parameter overhead but also increases the optimization complexity. To this end, this work proposes a novel Gaussian mixture mask (GMM) in which one mask only has two learnable parameters and it can be conveniently used in any ViT variants whose attention mechanism allows the use of masks. Experimental results on multiple small datasets demonstrate that the effectiveness of our proposed Gaussian mask for boosting ViTs for free (almost zero additional parameter or computation cost). Our code will be publicly available at \href{https://github.com/CatworldLee/Gaussian-Mixture-Mask-Attention}{https://github.com/CatworldLee/Gaussian-Mixture-Mask-Attention}.
翻译:视觉Transformer(ViT)在广泛的图像识别任务中取得了广泛报道的成功。相比卷积神经网络(CNN),ViT的优势很大程度上归因于大规模训练数据集或辅助预训练。在没有预训练的情况下,ViT在小数据集上的性能受限,因为全局自注意力机制在局部建模方面能力有限。为了在不依赖预训练的情况下提升ViT在小数据集上的表现,本文通过在原自注意力矩阵上施加权重掩码来改进其局部建模能力。一种直接实现自注意力矩阵局部自适应的方式是利用逐元素可学习权重掩码(ELM),初步实验结果表明该方法颇有前景。然而,这种简单的逐元素可学习权重掩码不仅带来了显著的非必要参数开销,还增加了优化复杂度。为此,本文提出一种新颖的高斯混合掩码(GMM),该掩码仅包含两个可学习参数,可便捷地应用于任何允许使用掩码的自注意力机制的ViT变体。在多个小数据集上的实验结果表明,我们所提出的高斯掩码能够以几乎零额外参数或计算成本的代价有效提升ViT性能。代码将开源在\href{https://github.com/CatworldLee/Gaussian-Mixture-Mask-Attention}{https://github.com/CatworldLee/Gaussian-Mixture-Mask-Attention}。