Vision Transformers (ViTs) have emerged as powerful models in the field of computer vision, delivering superior performance across various vision tasks. However, the high computational complexity poses a significant barrier to their practical applications in real-world scenarios. Motivated by the fact that not all tokens contribute equally to the final predictions and fewer tokens bring less computational cost, reducing redundant tokens has become a prevailing paradigm for accelerating vision transformers. However, we argue that it is not optimal to either only reduce inattentive redundancy by token pruning, or only reduce duplicative redundancy by token merging. To this end, in this paper we propose a novel acceleration framework, namely token Pruning & Pooling Transformers (PPT), to adaptively tackle these two types of redundancy in different layers. By heuristically integrating both token pruning and token pooling techniques in ViTs without additional trainable parameters, PPT effectively reduces the model complexity while maintaining its predictive accuracy. For example, PPT reduces over 37% FLOPs and improves the throughput by over 45% for DeiT-S without any accuracy drop on the ImageNet dataset.
翻译:视觉Transformer(ViTs)已成为计算机视觉领域中的强大模型,在各种视觉任务中展现出卓越性能。然而,其高计算复杂度严重制约了在真实场景中的实际应用。鉴于并非所有词元对最终预测的贡献均等,且减少词元数量可降低计算成本,消除冗余词元已成为加速视觉Transformer的主流范式。但我们认为,仅通过词元剪枝消除注意力不足冗余,或仅通过词元合并消除重复冗余,均非最优方案。为此,本文提出新型加速框架——词元剪枝与池化Transformer(PPT),自适应处理不同层中的这两种冗余类型。通过启发式方式在ViTs中集成词元剪枝与词元池化技术,且无需额外可训练参数,PPT在保持预测精度的同时有效降低模型复杂度。例如,针对DeiT-S模型,PPT在ImageNet数据集上实现零精度损失的情况下,减少了超过37%的FLOPs并提升了45%以上的吞吐量。