Vision Transformers (ViTs) have emerged as powerful models in the field of computer vision, delivering superior performance across various vision tasks. However, the high computational complexity poses a significant barrier to their practical applications in real-world scenarios. Motivated by the fact that not all tokens contribute equally to the final predictions and fewer tokens bring less computational cost, reducing redundant tokens has become a prevailing paradigm for accelerating vision transformers. However, we argue that it is not optimal to either only reduce inattentive redundancy by token pruning, or only reduce duplicative redundancy by token merging. To this end, in this paper we propose a novel acceleration framework, namely token Pruning & Pooling Transformers (PPT), to adaptively tackle these two types of redundancy in different layers. By heuristically integrating both token pruning and token pooling techniques in ViTs without additional trainable parameters, PPT effectively reduces the model complexity while maintaining its predictive accuracy. For example, PPT reduces over 37% FLOPs and improves the throughput by over 45% for DeiT-S without any accuracy drop on the ImageNet dataset. The code is available at https://github.com/xjwu1024/PPT and https://github.com/mindspore-lab/models/
翻译:视觉Transformer(ViT)已成为计算机视觉领域的强大模型,在各种视觉任务中展现出卓越性能。然而,高计算复杂度严重制约了其在真实场景中的实际应用。鉴于并非所有令牌对最终预测的贡献均等,且减少令牌数量可降低计算成本,减少冗余令牌已成为加速视觉Transformer的主流范式。但我们认为,仅通过令牌剪枝减少注意力冗余,或仅通过令牌合并减少重复冗余,均非最优策略。为此,本文提出一种新型加速框架——令牌剪枝池化Transformer(PPT),通过自适应处理不同层级中的两类冗余,在无需额外可训练参数的情况下,启发式地整合ViT中的令牌剪枝与令牌池化技术,有效降低模型复杂度并保持预测精度。例如,在ImageNet数据集上,PPT为DeiT-S减少37%以上的FLOPs,吞吐量提升45%以上且无精度损失。代码已开源:https://github.com/xjwu1024/PPT 及 https://github.com/mindspore-lab/models/