EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention

Vision transformers have shown great success due to their high model capabilities. However, their remarkable performance is accompanied by heavy computation costs, which makes them unsuitable for real-time applications. In this paper, we propose a family of high-speed vision transformers named EfficientViT. We find that the speed of existing transformer models is commonly bounded by memory inefficient operations, especially the tensor reshaping and element-wise functions in MHSA. Therefore, we design a new building block with a sandwich layout, i.e., using a single memory-bound MHSA between efficient FFN layers, which improves memory efficiency while enhancing channel communication. Moreover, we discover that the attention maps share high similarities across heads, leading to computational redundancy. To address this, we present a cascaded group attention module feeding attention heads with different splits of the full feature, which not only saves computation cost but also improves attention diversity. Comprehensive experiments demonstrate EfficientViT outperforms existing efficient models, striking a good trade-off between speed and accuracy. For instance, our EfficientViT-M5 surpasses MobileNetV3-Large by 1.9% in accuracy, while getting 40.4% and 45.2% higher throughput on Nvidia V100 GPU and Intel Xeon CPU, respectively. Compared to the recent efficient model MobileViT-XXS, EfficientViT-M2 achieves 1.8% superior accuracy, while running 5.8x/3.7x faster on the GPU/CPU, and 7.4x faster when converted to ONNX format. Code and models are available at https://github.com/microsoft/Cream/tree/main/EfficientViT.

翻译：视觉Transformer凭借其强大的模型能力取得了巨大成功。然而，其卓越性能伴随着高计算成本，使其不适合实时应用场景。本文提出一类名为EfficientViT的高速视觉Transformer模型。我们发现现有Transformer模型的速度通常受限于低内存效率的操作，尤其是多头自注意力机制（MHSA）中的张量重塑与逐元素函数。为此，我们设计了一种采用三明治布局的新型构建模块——即在高效前馈网络（FFN）层之间嵌入单一内存密集型MHSA模块，从而在提升通道通信能力的同时改善内存效率。此外，我们观察到不同注意力头之间的注意力图存在高度相似性，导致计算冗余。为解决该问题，我们提出级联分组注意力模块，通过向各注意力头分配全特征的不同分割部分，既降低了计算成本又增强了注意力多样性。综合实验表明，EfficientViT在速度与精度间取得了良好平衡，性能优于现有高效模型。例如，EfficientViT-M5在Nvidia V100 GPU和Intel Xeon CPU上分别实现40.4%和45.2%的吞吐量提升，同时准确率较MobileNetV3-Large提高1.9%。与近期高效模型MobileViT-XXS相比，EfficientViT-M2在GPU/CPU上的运行速度提升5.8倍/3.7倍，转换为ONNX格式后速度提升7.4倍，且准确率高出1.8%。代码与模型已开源至https://github.com/microsoft/Cream/tree/main/EfficientViT。