Vision Transformers (ViTs) are becoming more popular and dominating technique for various vision tasks, compare to Convolutional Neural Networks (CNNs). As a demanding technique in computer vision, ViTs have been successfully solved various vision problems while focusing on long-range relationships. In this paper, we begin by introducing the fundamental concepts and background of the self-attention mechanism. Next, we provide a comprehensive overview of recent top-performing ViT methods describing in terms of strength and weakness, computational cost as well as training and testing dataset. We thoroughly compare the performance of various ViT algorithms and most representative CNN methods on popular benchmark datasets. Finally, we explore some limitations with insightful observations and provide further research direction. The project page along with the collections of papers are available at https://github.com/khawar512/ViT-Survey
翻译:视觉Transformer(ViT)正逐渐成为超越卷积神经网络(CNN)的主流技术,广泛应用于各类视觉任务。作为计算机视觉中的一项关键方法,ViT在关注长距离依赖关系的同时,成功解决了多种视觉问题。本文首先介绍了自注意力机制的基本概念与背景知识。接着,我们全面综述了近期表现优异的ViT方法,从优势与局限、计算成本以及训练与测试数据集等方面进行阐述。我们深入比较了各类ViT算法与最具代表性的CNN方法在主流基准数据集上的性能表现。最后,我们基于深刻洞察探讨了若干局限性问题,并指出了未来的研究方向。相关项目页面及论文合集请访问 https://github.com/khawar512/ViT-Survey。