Image Classification is a fundamental task in the field of computer vision that frequently serves as a benchmark for gauging advancements in Computer Vision. Over the past few years, significant progress has been made in image classification due to the emergence of deep learning. However, challenges still exist, such as modeling fine-grained visual information, high computation costs, the parallelism of the model, and inconsistent evaluation protocols across datasets. In this paper, we conduct a comprehensive survey of existing papers on Vision Transformers for image classification. We first introduce the popular image classification datasets that influenced the design of models. Then, we present Vision Transformers models in chronological order, starting with early attempts at adapting attention mechanism to vision tasks followed by the adoption of vision transformers, as they have demonstrated success in capturing intricate patterns and long-range dependencies within images. Finally, we discuss open problems and shed light on opportunities for image classification to facilitate new research ideas.
翻译:图像分类是计算机视觉领域的基础任务,常作为衡量计算机视觉进步的基准。近年来,由于深度学习的兴起,图像分类取得了显著进展。然而,仍存在诸多挑战,如细粒度视觉信息建模、高计算成本、模型并行性以及跨数据集评估协议不一致等问题。本文对现有关于视觉Transformer用于图像分类的论文进行了全面综述。首先,我们介绍了影响模型设计的常用图像分类数据集。随后,按时间顺序梳理了视觉Transformer模型,从早期将注意力机制适配到视觉任务的尝试开始,到采用视觉Transformer的成功案例,这些方法在捕获图像中的复杂模式与长距离依赖关系方面展现了优异性能。最后,我们讨论了开放性问题,并揭示了图像分类领域的研究机遇,以促进新的研究思路。