Vision Transformers (ViTs) have shown impressive performance and have become a unified backbone for multiple vision tasks. However, both the attention mechanism and multi-layer perceptrons (MLPs) in ViTs are not sufficiently efficient due to dense multiplications, leading to costly training and inference. To this end, we propose to reparameterize pre-trained ViTs with a mixture of multiplication primitives, e.g., bitwise shifts and additions, towards a new type of multiplication-reduced model, dubbed $\textbf{ShiftAddViT}$, which aims to achieve end-to-end inference speedups on GPUs without requiring training from scratch. Specifically, all $\texttt{MatMuls}$ among queries, keys, and values are reparameterized using additive kernels, after mapping queries and keys to binary codes in Hamming space. The remaining MLPs or linear layers are then reparameterized with shift kernels. We utilize TVM to implement and optimize those customized kernels for practical hardware deployment on GPUs. We find that such a reparameterization on attention maintains model accuracy, while inevitably leading to accuracy drops when being applied to MLPs. To marry the best of both worlds, we further propose a new mixture of experts (MoE) framework to reparameterize MLPs by taking multiplication or its primitives as experts, e.g., multiplication and shift, and designing a new latency-aware load-balancing loss. Such a loss helps to train a generic router for assigning a dynamic amount of input tokens to different experts according to their latency. Extensive experiments on various 2D/3D Transformer-based vision tasks consistently validate the effectiveness of our proposed ShiftAddViT, achieving up to $\textbf{5.18$\times$}$ latency reductions on GPUs and $\textbf{42.9}$% energy savings, while maintaining a comparable accuracy as original or efficient ViTs.
翻译:视觉Transformer(ViT)已展现出卓越的性能,并成为多种视觉任务的统一骨干网络。然而,由于密集的乘法运算,ViT中的注意力机制和多层感知机(MLP)均不够高效,导致训练和推理成本高昂。为此,我们提出使用乘法原语(例如按位移位和加法)的混合形式对预训练ViT进行重参数化,从而构建一种新型的乘法简化模型——$\textbf{ShiftAddViT}$,旨在无需从头训练即可在GPU上实现端到端推理加速。具体而言,在将查询和键映射到汉明空间的二进制编码后,所有查询、键与值之间的$\texttt{矩阵乘法}$均通过加法核实现重参数化;剩余的MLP或线性层则通过移位核进行重参数化。我们利用TVM实现并优化这些定制化核函数,以适配GPU的实际硬件部署。研究发现,注意力模块的此类重参数化能够保持模型精度,但应用于MLP时不可避免地会导致精度下降。为兼顾二者优势,我们进一步提出一种新的混合专家(MoE)框架,将乘法或其原语(例如乘法和移位)作为专家对MLP进行重参数化,并设计了一种新的延迟感知负载均衡损失函数。该损失函数有助于训练通用路由器,使其能够根据专家延迟动态分配不同数量的输入令牌。在多种基于Transformer的2D/3D视觉任务上的大量实验一致验证了所提ShiftAddViT的有效性,在保持与原始或高效ViT相当精度的同时,实现了GPU上高达$\textbf{5.18$\times$}$的延迟降低和$\textbf{42.9}$%的能耗节省。