Vision Transformer (ViT) architectures are becoming increasingly popular and widely employed to tackle computer vision applications. Their main feature is the capacity to extract global information through the self-attention mechanism, outperforming earlier convolutional neural networks. However, ViT deployment and performance have grown steadily with their size, number of trainable parameters, and operations. Furthermore, self-attention's computational and memory cost quadratically increases with the image resolution. Generally speaking, it is challenging to employ these architectures in real-world applications due to many hardware and environmental restrictions, such as processing and computational capabilities. Therefore, this survey investigates the most efficient methodologies to ensure sub-optimal estimation performances. More in detail, four efficient categories will be analyzed: compact architecture, pruning, knowledge distillation, and quantization strategies. Moreover, a new metric called Efficient Error Rate has been introduced in order to normalize and compare models' features that affect hardware devices at inference time, such as the number of parameters, bits, FLOPs, and model size. Summarizing, this paper firstly mathematically defines the strategies used to make Vision Transformer efficient, describes and discusses state-of-the-art methodologies, and analyzes their performances over different application scenarios. Toward the end of this paper, we also discuss open challenges and promising research directions.
翻译:视觉Transformer(Vision Transformer, ViT)架构正日益流行并被广泛用于解决计算机视觉应用。其主要特征在于通过自注意力机制提取全局信息的能力,从而超越了早期的卷积神经网络。然而,ViT的部署与性能随着其规模、可训练参数数量及运算量的增长而稳步提升。此外,自注意力机制的计算和内存成本随图像分辨率呈二次方增长。通常而言,由于处理能力与计算能力等诸多硬件与环境限制,将这些架构应用于实际场景颇具挑战。因此,本综述探究了确保次优估计性能的最有效方法。具体而言,我们分析了四类高效策略:紧凑架构、剪枝、知识蒸馏与量化。此外,我们引入了一种新指标——高效错误率(Efficient Error Rate),以标准化并比较推理时影响硬件设备的模型特征,如参数量、比特数、FLOPs及模型大小。总之,本文首先以数学方式定义了提升Vision Transformer效率的策略,描述并讨论了最新方法,随后分析了它们在不同应用场景下的性能。最后,本文还探讨了现有挑战与未来有前景的研究方向。