Model binarization can significantly compress model size, reduce energy consumption, and accelerate inference through efficient bit-wise operations. Although binarizing convolutional neural networks have been extensively studied, there is little work on exploring binarization of vision Transformers which underpin most recent breakthroughs in visual recognition. To this end, we propose to solve two fundamental challenges to push the horizon of Binary Vision Transformers (BiViT). First, the traditional binary method does not take the long-tailed distribution of softmax attention into consideration, bringing large binarization errors in the attention module. To solve this, we propose Softmax-aware Binarization, which dynamically adapts to the data distribution and reduces the error caused by binarization. Second, to better preserve the information of the pretrained model and restore accuracy, we propose a Cross-layer Binarization scheme that decouples the binarization of self-attention and multi-layer perceptrons (MLPs), and Parameterized Weight Scales which introduce learnable scaling factors for weight binarization. Overall, our method performs favorably against state-of-the-arts by 19.8% on the TinyImageNet dataset. On ImageNet, our BiViT achieves a competitive 75.6% Top-1 accuracy over Swin-S model. Additionally, on COCO object detection, our method achieves an mAP of 40.8 with a Swin-T backbone over Cascade Mask R-CNN framework.
翻译:模型二值化能通过高效的位运算显著压缩模型体积、降低能耗并加速推理。尽管二值化卷积神经网络已得到广泛研究,但针对视觉Transformer的二值化探索仍十分有限,而这类模型正是近期视觉识别领域多数突破性进展的基础。为此,本文提出解决两个根本性挑战以推动二值化视觉Transformer(BiViT)的发展。首先,传统二值化方法未考虑Softmax注意力分布的长尾特性,导致注意力模块产生较大二值化误差。针对此问题,我们提出Softmax感知二值化方法,能动态适应数据分布并降低二值化造成的误差。其次,为更好保留预训练模型信息并恢复精度,我们提出跨层二值化方案,将自注意力与多层感知机(MLPs)的二值化解耦,并引入参数化权重缩放因子用于权重二值化。总体而言,本方法在TinyImageNet数据集上相比现有最优方法提升19.8%性能。在ImageNet上,BiViT在Swin-S模型上实现了75.6%的竞争性Top-1准确率。此外,在COCO目标检测任务中,基于Cascade Mask R-CNN框架使用Swin-T骨干网络时,本方法达到40.8的mAP。