Memory-Efficient Vision Transformers: An Activation-Aware Mixed-Rank Compression Strategy

As Vision Transformers (ViTs) increasingly set new benchmarks in computer vision, their practical deployment on inference engines is often hindered by their significant memory bandwidth and (on-chip) memory footprint requirements. This paper addresses this memory limitation by introducing an activation-aware model compression methodology that uses selective low-rank weight tensor approximations of different layers to reduce the parameter count of ViTs. The key idea is to decompose the weight tensors into a sum of two parameter-efficient tensors while minimizing the error between the product of the input activations with the original weight tensor and the product of the input activations with the approximate tensor sum. This approximation is further refined by adopting an efficient layer-wise error compensation technique that uses the gradient of the layer's output loss. The combination of these techniques achieves excellent results while it avoids being trapped in a shallow local minimum early in the optimization process and strikes a good balance between the model compression and output accuracy. Notably, the presented method significantly reduces the parameter count of DeiT-B by 60% with less than 1% accuracy drop on the ImageNet dataset, overcoming the usual accuracy degradation seen in low-rank approximations. In addition to this, the presented compression technique can compress large DeiT/ViT models to have about the same model size as smaller DeiT/ViT variants while yielding up to 1.8% accuracy gain. These results highlight the efficacy of our approach, presenting a viable solution for embedding ViTs in memory-constrained environments without compromising their performance.

翻译：随着视觉Transformer（ViT）在计算机视觉领域不断树立新的基准，其在实际推理引擎中的部署常因显著的内存带宽和（片上）内存占用需求而受阻。本文通过引入一种激活感知的模型压缩方法来解决这一内存限制问题，该方法利用不同层的有选择低秩权重张量近似来减少ViT的参数数量。核心思想是将权重张量分解为两个参数高效张量之和，同时最小化输入激活与原始权重张量的乘积和输入激活与近似张量和的乘积之间的误差。通过采用基于层级输出损失梯度的有效逐层误差补偿技术，该近似得以进一步优化。这些技术的结合在取得优异结果的同时，避免了优化过程早期陷入浅层局部最小值，并在模型压缩与输出精度之间达成了良好平衡。值得注意的是，所提出的方法在ImageNet数据集上将DeiT-B的参数数量减少60%，同时精度下降不足1%，克服了低秩近似中常见的精度退化问题。此外，所提出的压缩技术可将大型DeiT/ViT模型压缩至与较小DeiT/ViT变体相当的模型大小，同时获得高达1.8%的精度提升。这些结果凸显了我们方法的有效性，为将ViT嵌入内存受限环境且不影响其性能提供了可行解决方案。