Vision Transformers (ViTs) have shown impressive performance in computer vision, but their high computational cost, quadratic in the number of tokens, limits their adoption in computation-constrained applications. However, this large number of tokens may not be necessary, as not all tokens are equally important. In this paper, we investigate token pruning to accelerate inference for object detection and instance segmentation, extending prior works from image classification. Through extensive experiments, we offer four insights for dense tasks: (i) tokens should not be completely pruned and discarded, but rather preserved in the feature maps for later use. (ii) reactivating previously pruned tokens can further enhance model performance. (iii) a dynamic pruning rate based on images is better than a fixed pruning rate. (iv) a lightweight, 2-layer MLP can effectively prune tokens, achieving accuracy comparable with complex gating networks with a simpler design. We evaluate the impact of these design choices on COCO dataset and present a method integrating these insights that outperforms prior art token pruning models, significantly reducing performance drop from ~1.5 mAP to ~0.3 mAP for both boxes and masks. Compared to the dense counterpart that uses all tokens, our method achieves up to 34% faster inference speed for the whole network and 46% for the backbone.
翻译:视觉Transformer(Vision Transformers, ViTs)在计算机视觉领域展现出卓越性能,但其计算成本高(与令牌数量呈二次关系),限制了其在计算资源受限场景下的应用。然而,大量令牌可能并非必要,因为并非所有令牌都同等重要。本文探究了通过令牌剪枝加速目标检测与实例分割推理的方法,将先前图像分类领域的研究扩展至密集任务。通过大量实验,我们针对密集任务提出四点见解:(i)令牌不应被完全剪除丢弃,而应保留在特征图中供后续使用;(ii)重新激活先前剪除的令牌可进一步提升模型性能;(iii)基于图像的动态剪枝率优于固定剪枝率;(iv)轻量级双层MLP能有效剪枝令牌,以更简洁设计实现与复杂门控网络相当的准确率。我们在COCO数据集上评估了这些设计选择的影响,并提出融合上述见解的方法。该方法优于现有令牌剪枝模型,将边界框和掩码的性能下降幅度从约1.5 mAP显著降至约0.3 mAP。相较于使用全部令牌的密集模型,我们的方法使整个网络的推理速度提升达34%,骨干网络提升达46%。