Vision transformers have recently become popular as a possible alternative to convolutional neural networks (CNNs) for a variety of computer vision applications. These vision transformers due to their ability to focus on global relationships in images have large capacity, but may result in poor generalization as compared to CNNs. Very recently, the hybridization of convolution and self-attention mechanisms in vision transformers is gaining popularity due to their ability of exploiting both local and global image representations. These CNN-Transformer architectures also known as hybrid vision transformers have shown remarkable results for vision applications. Recently, due to the rapidly growing number of these hybrid vision transformers, there is a need for a taxonomy and explanation of these architectures. This survey presents a taxonomy of the recent vision transformer architectures, and more specifically that of the hybrid vision transformers. Additionally, the key features of each architecture such as the attention mechanisms, positional embeddings, multi-scale processing, and convolution are also discussed. This survey highlights the potential of hybrid vision transformers to achieve outstanding performance on a variety of computer vision tasks. Moreover, it also points towards the future directions of this rapidly evolving field.
翻译:视觉 Transformer 最近因其作为卷积神经网络(CNN)在多种计算机视觉应用中的潜在替代方案而备受关注。这些视觉 Transformer 由于能够聚焦图像中的全局关系,因而具备强大的容量,但与 CNN 相比,可能导致较差的泛化能力。近期,卷积与自注意力机制在视觉 Transformer 中的混合化日益流行,因其能够同时利用局部和全局图像表示。这些 CNN-Transformer 架构,又称混合视觉 Transformer,在视觉应用中展现出显著成果。由于此类混合视觉 Transformer 数量快速增长,亟需对其架构进行分类与阐释。本综述对近期视觉 Transformer 架构,特别是混合视觉 Transformer 进行了分类。此外,还讨论了每种架构的关键特征,如注意力机制、位置嵌入、多尺度处理及卷积。本综述凸显了混合视觉 Transformer 在多种计算机视觉任务中实现卓越性能的潜力,同时也指出了这一快速演进领域的未来方向。