BinaryViT: Towards Efficient and Accurate Binary Vision Transformers

Vision Transformers (ViTs) have emerged as the fundamental architecture for most computer vision fields, but the considerable memory and computation costs hinders their application on resource-limited devices. As one of the most powerful compression methods, binarization reduces the computation of the neural network by quantizing the weights and activation values as $\pm$1. Although existing binarization methods have demonstrated excellent performance on Convolutional Neural Networks (CNNs), the full binarization of ViTs is still under-studied and suffering a significant performance drop. In this paper, we first argue empirically that the severe performance degradation is mainly caused by the weight oscillation in the binarization training and the information distortion in the activation of ViTs. Based on these analyses, we propose $\textbf{BinaryViT}$, an accurate full binarization scheme for ViTs, which pushes the quantization of ViTs to the limit. Specifically, we propose a novel gradient regularization scheme (GRS) for driving a bimodal distribution of the weights to reduce oscillation in binarization training. Moreover, we design an activation shift module (ASM) to adaptively tune the activation distribution to reduce the information distortion caused by binarization. Extensive experiments on ImageNet dataset show that our BinaryViT consistently surpasses the strong baseline by 2.05% and improve the accuracy of fully binarized ViTs to a usable level. Furthermore, our method achieves impressive savings of 16.2$\times$ and 17.7$\times$ in model size and OPs compared to the full-precision DeiT-S.

翻译：视觉Transformer已成为大多数计算机视觉领域的基础架构，但其巨大的内存和计算成本限制了在资源受限设备上的应用。作为最强大的压缩方法之一，二值化通过将权重和激活值量化为±1来降低神经网络的计算量。尽管现有二值化方法已在卷积神经网络上展现出优异性能，但视觉Transformer的完全二值化仍处于研究不足且性能显著下降的状态。本文首先通过实验论证，认为严重的性能退化主要由二值化训练中的权重振荡和视觉Transformer激活中的信息失真所导致。基于这些分析，我们提出了BinaryViT——一种针对视觉Transformer的精确完全二值化方案，将视觉Transformer的量化推至极限。具体而言，我们提出了一种新颖的梯度正则化方案，通过驱动权重呈现双峰分布来减少二值化训练中的振荡。此外，我们设计了一个激活偏移模块，用于自适应调整激活分布，以减轻二值化导致的信息失真。在ImageNet数据集上的大量实验表明，我们的BinaryViT始终以2.05%的优势超越强基线，并将完全二值化视觉Transformer的精度提升至可用水平。此外，与全精度DeiT-S相比，我们的方法在模型大小和操作数上分别实现了16.2倍和17.7倍的惊人节省。