I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference

Vision Transformers (ViTs) have achieved state-of-the-art performance on various computer vision applications. However, these models have considerable storage and computational overheads, making their deployment and efficient inference on edge devices challenging. Quantization is a promising approach to reducing model complexity, and the dyadic arithmetic pipeline can allow the quantized models to perform efficient integer-only inference. Unfortunately, dyadic arithmetic is based on the homogeneity condition in convolutional neural networks, which is not applicable to the non-linear components in ViTs, making integer-only inference of ViTs an open issue. In this paper, we propose I-ViT, an integer-only quantization scheme for ViTs, to enable ViTs to perform the entire computational graph of inference with integer arithmetic and bit-shifting, and without any floating-point arithmetic. In I-ViT, linear operations (e.g., MatMul and Dense) follow the integer-only pipeline with dyadic arithmetic, and non-linear operations (e.g., Softmax, GELU, and LayerNorm) are approximated by the proposed light-weight integer-only arithmetic methods. More specifically, I-ViT applies the proposed Shiftmax and ShiftGELU, which are designed to use integer bit-shifting to approximate the corresponding floating-point operations. We evaluate I-ViT on various benchmark models and the results show that integer-only INT8 quantization achieves comparable (or even slightly higher) accuracy to the full-precision (FP) baseline. Furthermore, we utilize TVM for practical hardware deployment on the GPU's integer arithmetic units, achieving 3.72$\sim$4.11$\times$ inference speedup compared to the FP model. Code of both Pytorch and TVM is released at https://github.com/zkkli/I-ViT.

翻译：视觉Transformer（ViTs）已在多种计算机视觉应用中取得了最先进的性能。然而，这些模型具有显著的存储和计算开销，使其在边缘设备上的部署与高效推理面临挑战。量化是降低模型复杂度的有效方法，而二进算术流水线可使量化模型执行高效的整数仅推理。但二进算术基于卷积神经网络中的同质性条件，该条件不适用于ViTs中的非线性组件，导致ViTs的整数仅推理仍是未解难题。本文提出I-ViT——一种面向ViTs的整数仅量化方案，使ViT能够通过整数算术和位移操作完成完整推理计算图，无需任何浮点运算。在I-ViT中，线性操作（如MatMul和Dense）遵循基于二进算术的整数仅流水线，而非线性操作（如Softmax、GELU和LayerNorm）则通过所提出的轻量级整数仅算术方法进行近似。具体而言，I-ViT应用了所提出的Shiftmax和ShiftGELU，设计利用整数位移来近似对应浮点操作。我们在多种基准模型上评估I-ViT，结果表明整数仅INT8量化可达到与全精度（FP）基线相当（甚至略高）的精度。此外，我们利用TVM在GPU整数算术单元上进行实际硬件部署，相比FP模型实现了3.72∼4.11倍的推理加速。Pytorch和TVM代码已开源至https://github.com/zkkli/I-ViT。