Transformer, first applied to the field of natural language processing, is a type of deep neural network mainly based on the self-attention mechanism. Thanks to its strong representation capabilities, researchers are looking at ways to apply transformer to computer vision tasks. In a variety of visual benchmarks, transformer-based models perform similar to or better than other types of networks such as convolutional and recurrent neural networks. Given its high performance and less need for vision-specific inductive bias, transformer is receiving more and more attention from the computer vision community. In this paper, we review these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages. The main categories we explore include the backbone network, high/mid-level vision, low-level vision, and video processing. We also include efficient transformer methods for pushing transformer into real device-based applications. Furthermore, we also take a brief look at the self-attention mechanism in computer vision, as it is the base component in transformer. Toward the end of this paper, we discuss the challenges and provide several further research directions for vision transformers.
翻译:Transformer最初应用于自然语言处理领域,是一种主要基于自注意力机制的深度神经网络。由于其强大的表征能力,研究人员正积极探索将Transformer应用于计算机视觉任务的方法。在各类视觉基准测试中,基于Transformer的模型在性能上可媲美甚至优于卷积神经网络和循环神经网络等其他类型网络。凭借其高性能及对视觉领域专用归纳偏置需求的降低,Transformer正日益受到计算机视觉领域的关注。本文通过按不同任务对视觉Transformer模型进行分类,并分析其优缺点,对这些模型进行了综述。我们探讨的主要类别包括骨干网络、高层/中层视觉、低层视觉及视频处理。我们还介绍了适用于将Transformer部署于实际设备的高效方法。此外,本文对作为Transformer基础组件的自注意力机制在计算机视觉中的应用也进行了简要探讨。在论文末尾,我们讨论了视觉Transformer面临的挑战,并提出了若干未来研究方向。