The transformer extends its success from the language to the vision domain. Because of the stacked self-attention and cross-attention blocks, the acceleration deployment of vision transformer on GPU hardware is challenging and also rarely studied. This paper thoroughly designs a compression scheme to maximally utilize the GPU-friendly 2:4 fine-grained structured sparsity and quantization. Specially, an original large model with dense weight parameters is first pruned into a sparse one by 2:4 structured pruning, which considers the GPU's acceleration of 2:4 structured sparse pattern with FP16 data type, then the floating-point sparse model is further quantized into a fixed-point one by sparse-distillation-aware quantization aware training, which considers GPU can provide an extra speedup of 2:4 sparse calculation with integer tensors. A mixed-strategy knowledge distillation is used during the pruning and quantization process. The proposed compression scheme is flexible to support supervised and unsupervised learning styles. Experiment results show GPUSQ-ViT scheme achieves state-of-the-art compression by reducing vision transformer models 6.4-12.7 times on model size and 30.3-62 times on FLOPs with negligible accuracy degradation on ImageNet classification, COCO detection and ADE20K segmentation benchmarking tasks. Moreover, GPUSQ-ViT can boost actual deployment performance by 1.39-1.79 times and 3.22-3.43 times of latency and throughput on A100 GPU, and 1.57-1.69 times and 2.11-2.51 times improvement of latency and throughput on AGX Orin.
翻译:Transformer的成功已从语言领域扩展至视觉领域。由于堆叠的自注意力与交叉注意力模块,视觉Transformer在GPU硬件上的加速部署极具挑战性且鲜有研究。本文系统设计了一种压缩方案,以最大化利用GPU友好的2:4细粒度结构化稀疏性与量化技术。具体而言,首先通过2:4结构化剪枝将具有密集权重参数的原始大模型剪枝为稀疏模型——该过程考虑了GPU对FP16数据类型下2:4结构化稀疏模式的加速特性;随后通过稀疏蒸馏感知的量化感知训练,将浮点稀疏模型进一步量化为定点模型——该过程利用了GPU对整型张量的2:4稀疏计算可提供额外加速的特性。在剪枝与量化过程中采用了混合策略知识蒸馏。所提出的压缩方案可灵活支持监督与无监督学习范式。实验表明,GPUSQ-ViT方案在ImageNet分类、COCO检测及ADE20K分割基准任务中,实现了模型体积压缩6.4-12.7倍、FLOPs压缩30.3-62倍且精度损失可忽略的最优压缩效果。此外,在A100 GPU上,GPUSQ-ViT可将实际部署的延迟与吞吐量分别提升1.39-1.79倍和3.22-3.43倍;在AGX Orin上,对应提升幅度分别为1.57-1.69倍和2.11-2.51倍。