Making Vision Transformers Efficient from A Token Sparsification View

The quadratic computational complexity to the number of tokens limits the practical applications of Vision Transformers (ViTs). Several works propose to prune redundant tokens to achieve efficient ViTs. However, these methods generally suffer from (i) dramatic accuracy drops, (ii) application difficulty in the local vision transformer, and (iii) non-general-purpose networks for downstream tasks. In this work, we propose a novel Semantic Token ViT (STViT), for efficient global and local vision transformers, which can also be revised to serve as backbone for downstream tasks. The semantic tokens represent cluster centers, and they are initialized by pooling image tokens in space and recovered by attention, which can adaptively represent global or local semantic information. Due to the cluster properties, a few semantic tokens can attain the same effect as vast image tokens, for both global and local vision transformers. For instance, only 16 semantic tokens on DeiT-(Tiny,Small,Base) can achieve the same accuracy with more than 100% inference speed improvement and nearly 60% FLOPs reduction; on Swin-(Tiny,Small,Base), we can employ 16 semantic tokens in each window to further speed it up by around 20% with slight accuracy increase. Besides great success in image classification, we also extend our method to video recognition. In addition, we design a STViT-R(ecover) network to restore the detailed spatial information based on the STViT, making it work for downstream tasks, which is powerless for previous token sparsification methods. Experiments demonstrate that our method can achieve competitive results compared to the original networks in object detection and instance segmentation, with over 30% FLOPs reduction for backbone.

翻译：视觉Transformer（ViT）的二次计算复杂度与标记数量成正比，这限制了其实际应用。现有工作通过剪枝冗余标记实现高效ViT，但普遍存在以下问题：（i）精度显著下降，（ii）难以应用于局部视觉Transformer，（iii）无法成为面向下游任务的通用网络。本文提出一种新型语义标记ViT（STViT），可同时适用于全局与局部视觉Transformer，并能改造为下游任务的骨干网络。语义标记代表聚类中心，通过空间池化图像标记进行初始化，并借助注意力机制恢复，可自适应表征全局或局部语义信息。得益于聚类特性，少量语义标记即可达到与海量图像标记相同的效果——在全局与局部视觉Transformer中均成立。例如，在DeiT（Tiny, Small, Base）上仅用16个语义标记即可保持原精度，同时推理速度提升超过100%、FLOPs减少近60%；在Swin（Tiny, Small, Base）的每个窗口中采用16个语义标记，可在精度略有提升的同时加速约20%。除图像分类外，本方法还可拓展至视频识别任务。此外，我们设计了基于STViT的STViT-R（Recover）网络，以恢复详细空间信息，使其能够支持此前标记稀疏化方法无法胜任的下游任务。实验表明，本方法在目标检测与实例分割任务中可与原始网络达到竞争性结果，同时骨干网络FLOPs减少超过30%。