Vision transformers have emerged as a promising alternative to convolutional neural networks for various image analysis tasks, offering comparable or superior performance. However, one significant drawback of ViTs is their resource-intensive nature, leading to increased memory footprint, computation complexity, and power consumption. To democratize this high-performance technology and make it more environmentally friendly, it is essential to compress ViT models, reducing their resource requirements while maintaining high performance. In this paper, we introduce a new block-structured pruning to address the resource-intensive issue for ViTs, offering a balanced trade-off between accuracy and hardware acceleration. Unlike unstructured pruning or channel-wise structured pruning, block pruning leverages the block-wise structure of linear layers, resulting in more efficient matrix multiplications. To optimize this pruning scheme, our paper proposes a novel hardware-aware learning objective that simultaneously maximizes speedup and minimizes power consumption during inference, tailored to the block sparsity structure. This objective eliminates the need for empirical look-up tables and focuses solely on reducing parametrized layer connections. Moreover, our paper provides a lightweight algorithm to achieve post-training pruning for ViTs, utilizing second-order Taylor approximation and empirical optimization to solve the proposed hardware-aware objective. Extensive experiments on ImageNet are conducted across various ViT architectures, including DeiT-B and DeiT-S, demonstrating competitive performance with other pruning methods and achieving a remarkable balance between accuracy preservation and power savings. Especially, we achieve up to 3.93x and 1.79x speedups on dedicated hardware and GPUs respectively for DeiT-B, and also observe an inference power reduction by 1.4x on real-world GPUs.
翻译:视觉Transformer已成为卷积神经网络在各种图像分析任务中的一种有前景的替代方案,其性能相当或更优。然而,ViT的一个显著缺点是其资源密集性,导致内存占用增加、计算复杂度提高和功耗上升。为了普及这种高性能技术并使其更环保,必须对ViT模型进行压缩,在保持高性能的同时降低其资源需求。本文提出了一种新的块结构化剪枝方法,以解决ViT的资源密集性问题,在精度与硬件加速之间提供平衡的权衡。与非结构化剪枝或通道级结构化剪枝不同,块剪枝利用线性层的块状结构,实现更高效的矩阵乘法运算。为优化此剪枝方案,本文提出了一种新颖的硬件感知学习目标,该目标针对块稀疏结构定制,能同时最大化推理加速并最小化功耗。该目标无需依赖经验查找表,仅专注于减少参数化层的连接。此外,本文提供了一种轻量级算法,利用二阶泰勒近似和经验优化来实现ViT的训练后剪枝,以求解所提出的硬件感知目标。我们在ImageNet数据集上对包括DeiT-B和DeiT-S在内的多种ViT架构进行了广泛实验,结果表明该方法与其他剪枝方法相比具有竞争力,并在精度保持与功耗节约之间实现了显著平衡。特别地,对于DeiT-B模型,我们在专用硬件和GPU上分别实现了最高3.93倍和1.79倍的加速,并在实际GPU上观察到推理功耗降低了1.4倍。