Transformers are popular neural network models that use layers of self-attention and fully-connected nodes with embedded tokens. Vision Transformers (ViT) adapt transformers for image recognition tasks. In order to do this, the images are split into patches and used as tokens. One issue with ViT is the lack of inductive bias toward image structures. Because ViT was adapted for image data from language modeling, the network does not explicitly handle issues such as local translations, pixel information, and information loss in the structures and features shared by multiple patches. Conversely, Convolutional Neural Networks (CNN) incorporate this information. Thus, in this paper, we propose the use of convolutional layers within ViT. Specifically, we propose a model called a Vision Conformer (ViC) which replaces the Multi-Layer Perceptron (MLP) in a ViT layer with a CNN. In addition, to use the CNN, we proposed to reconstruct the image data after the self-attention in a reverse embedding layer. Through the evaluation, we demonstrate that the proposed convolutions help improve the classification ability of ViT.
翻译:Transformer是流行的神经网络模型,其使用自注意力层和带有嵌入标记的全连接节点层。视觉Transformer(ViT)将Transformer适配于图像识别任务。为此,图像被分割成图像块并用作标记。ViT的一个问题是缺乏对图像结构的归纳偏置。由于ViT是从语言模型改编而来处理图像数据,网络并未显式处理诸如局部平移、像素信息以及多图像块共享结构和特征中的信息丢失等问题。相反,卷积神经网络(CNN)则融入了这些信息。因此,本文提出在ViT内部使用卷积层。具体而言,我们提出一种名为视觉变形器(ViC)的模型,该模型用CNN替换ViT层中的多层感知机(MLP)。此外,为使用CNN,我们提出了在自注意力之后通过反向嵌入层重建图像数据的方法。通过评估,我们证明了所提出的卷积有助于提升ViT的分类能力。