Vision Transformers (ViTs) have shown impressive performance in computer vision, but their high computational cost, quadratic in the number of tokens, limits their adoption in computation-constrained applications. However, this large number of tokens may not be necessary, as not all tokens are equally important. In this paper, we investigate token pruning to accelerate inference for object detection and instance segmentation, extending prior works from image classification. Through extensive experiments, we offer four insights for dense tasks: (i) tokens should not be completely pruned and discarded, but rather preserved in the feature maps for later use. (ii) reactivating previously pruned tokens can further enhance model performance. (iii) a dynamic pruning rate based on images is better than a fixed pruning rate. (iv) a lightweight, 2-layer MLP can effectively prune tokens, achieving accuracy comparable with complex gating networks with a simpler design. We evaluate the impact of these design choices on COCO dataset and present a method integrating these insights that outperforms prior art token pruning models, significantly reducing performance drop from ~1.5 mAP to ~0.3 mAP for both boxes and masks. Compared to the dense counterpart that uses all tokens, our method achieves up to 34% faster inference speed for the whole network and 46% for the backbone.
翻译:视觉Transformer(ViTs)在计算机视觉领域展现出卓越性能,但其与令牌数量呈二次方相关的高计算成本限制了其在计算受限场景中的应用。然而,并非所有令牌都具有同等重要性,因此大量令牌可能并非必要。本文针对目标检测与实例分割任务,在图像分类前期工作的基础上,研究通过令牌剪枝加速模型推理。通过大量实验,我们针对密集预测任务提出四项见解:(i)令牌不应被完全剪除丢弃,而应保留在特征图中供后续使用;(ii)重新激活先前被剪除的令牌可进一步提升模型性能;(iii)基于图像内容的动态剪枝率优于固定剪枝率;(iv)轻量级双层MLP可有效完成令牌剪枝,其精度可与采用更复杂设计的门控网络相媲美。我们在COCO数据集上评估了这些设计选择的影响,并提出融合上述见解的方法。该方法在边界框与掩膜上的性能下降幅度从先前最优令牌剪枝模型的约1.5 mAP显著降低至约0.3 mAP。与使用全部令牌的密集基线模型相比,本方法可实现整个网络推理速度最高提升34%,主干网络推理速度最高提升46%。