Vision Transformers (ViTs) have emerged as powerful models in the field of computer vision, delivering superior performance across various vision tasks. However, the high computational complexity poses a significant barrier to their practical applications in real-world scenarios. Motivated by the fact that not all tokens contribute equally to the final predictions and fewer tokens bring less computational cost, reducing redundant tokens has become a prevailing paradigm for accelerating vision transformers. However, we argue that it is not optimal to either only reduce inattentive redundancy by token pruning, or only reduce duplicative redundancy by token merging. To this end, in this paper we propose a novel acceleration framework, namely token Pruning & Pooling Transformers (PPT), to adaptively tackle these two types of redundancy in different layers. By heuristically integrating both token pruning and token pooling techniques in ViTs without additional trainable parameters, PPT effectively reduces the model complexity while maintaining its predictive accuracy. For example, PPT reduces over 37% FLOPs and improves the throughput by over 45% for DeiT-S without any accuracy drop on the ImageNet dataset. The code is available at https://github.com/xjwu1024/PPT and https://github.com/mindspore-lab/models/
翻译:视觉Transformer(ViT)已成为计算机视觉领域的强大模型,在各类视觉任务中展现出卓越性能。然而,其高计算复杂度严重阻碍了在实际场景中的部署应用。鉴于并非所有令牌对最终预测产生同等贡献,且减少令牌数量可降低计算成本,削减冗余令牌已成为加速视觉Transformer的主流范式。但我们发现,单纯通过令牌剪枝减少不相关冗余或仅通过令牌合并消除重复冗余,均非最优方案。为此,本文提出新型加速框架——令牌剪枝与池化Transformers(PPT),可自适应处理不同层中的两类冗余。通过启发式地集成令牌剪枝与令牌池化技术,且无需引入额外可训练参数,PPT在保持预测精度的同时有效降低了模型复杂度。例如,在ImageNet数据集上,PPT使DeiT-S的计算量降低37%以上,吞吐量提升45%以上,且未出现任何精度损失。代码已开源在https://github.com/xjwu1024/PPT 和 https://github.com/mindspore-lab/models/