Vision Transformers (ViTs) have emerged as the fundamental architecture for most computer vision fields, but the considerable memory and computation costs hinders their application on resource-limited devices. As one of the most powerful compression methods, binarization reduces the computation of the neural network by quantizing the weights and activation values as $\pm$1. Although existing binarization methods have demonstrated excellent performance on Convolutional Neural Networks (CNNs), the full binarization of ViTs is still under-studied and suffering a significant performance drop. In this paper, we first argue empirically that the severe performance degradation is mainly caused by the weight oscillation in the binarization training and the information distortion in the activation of ViTs. Based on these analyses, we propose $\textbf{BinaryViT}$, an accurate full binarization scheme for ViTs, which pushes the quantization of ViTs to the limit. Specifically, we propose a novel gradient regularization scheme (GRS) for driving a bimodal distribution of the weights to reduce oscillation in binarization training. Moreover, we design an activation shift module (ASM) to adaptively tune the activation distribution to reduce the information distortion caused by binarization. Extensive experiments on ImageNet dataset show that our BinaryViT consistently surpasses the strong baseline by 2.05% and improve the accuracy of fully binarized ViTs to a usable level. Furthermore, our method achieves impressive savings of 16.2$\times$ and 17.7$\times$ in model size and OPs compared to the full-precision DeiT-S.
翻译:视觉Transformer(ViTs)已成为大多数计算机视觉领域的基础架构,但其巨大的内存和计算成本限制了在资源受限设备上的应用。作为最强大的压缩方法之一,二元化通过将权重和激活值量化为±1来降低神经网络的计算量。尽管现有二元化方法在卷积神经网络(CNNs)上表现出优异性能,但ViTs的完全二元化仍处于研究不足阶段且存在显著的性能下降。本文通过实证分析首次指出,性能严重退化主要由二元化训练中的权重振荡和ViT激活中的信息失真引起。基于这些分析,我们提出BinaryViT——一种针对ViT的高精度完全二元化方案,将ViT的量化推向极限。具体而言,我们提出新颖的梯度正则化方案(GRS),通过驱动权重呈现双峰分布来减少二元化训练中的振荡。此外,我们设计激活移位模块(ASM)自适应调节激活分布,以降低二元化导致的信息失真。在ImageNet数据集上的大量实验表明,我们的BinaryViT持续超越强基线2.05%,并将完全二元化ViT的精度提升至可用水平。此外,与全精度DeiT-S相比,我们的方法在模型大小和操作数上分别实现了16.2倍和17.7倍的惊人压缩。