Vision transformer has emerged as a new paradigm in computer vision, showing excellent performance while accompanied by expensive computational cost. Image token pruning is one of the main approaches for ViT compression, due to the facts that the complexity is quadratic with respect to the token number, and many tokens containing only background regions do not truly contribute to the final prediction. Existing works either rely on additional modules to score the importance of individual tokens, or implement a fixed ratio pruning strategy for different input instances. In this work, we propose an adaptive sparse token pruning framework with a minimal cost. Specifically, we firstly propose an inexpensive attention head importance weighted class attention scoring mechanism. Then, learnable parameters are inserted as thresholds to distinguish informative tokens from unimportant ones. By comparing token attention scores and thresholds, we can discard useless tokens hierarchically and thus accelerate inference. The learnable thresholds are optimized in budget-aware training to balance accuracy and complexity, performing the corresponding pruning configurations for different input instances. Extensive experiments demonstrate the effectiveness of our approach. Our method improves the throughput of DeiT-S by 50% and brings only 0.2% drop in top-1 accuracy, which achieves a better trade-off between accuracy and latency than the previous methods.
翻译:视觉Transformer已成为计算机视觉中的新范式,在展现卓越性能的同时伴随着高昂的计算成本。由于复杂度与令牌数量呈二次方关系,且许多仅包含背景区域的令牌对最终预测无实质贡献,图像令牌剪枝成为ViT压缩的主要方法之一。现有工作要么依赖额外模块评估单个令牌的重要性,要么对不同输入实例采用固定比率的剪枝策略。本文提出了一种低成本的自适应稀疏令牌剪枝框架。具体而言,我们首先设计了一种廉价的注意力头重要性加权类注意力评分机制。随后,将可学习参数作为阈值插入,以区分信息性令牌与非重要令牌。通过比较令牌注意力分数与阈值,我们可以分层丢弃无用令牌,从而加速推理。可学习阈值通过预算感知训练进行优化,以平衡准确率与复杂度,对不同输入实例执行相应的剪枝配置。大量实验证明了我们方法的有效性。我们的方法将DeiT-S的吞吐量提升了50%,且top-1准确率仅下降0.2%,相比先前方法实现了准确率与延迟之间更优的权衡。