Vision Transformers (ViTs) have revolutionized the field of computer vision, yet their deployments on resource-constrained devices remain challenging due to high computational demands. To expedite pre-trained ViTs, token pruning and token merging approaches have been developed, which aim at reducing the number of tokens involved in the computation. However, these methods still have some limitations, such as image information loss from pruned tokens and inefficiency in the token-matching process. In this paper, we introduce a novel Graph-based Token Propagation (GTP) method to resolve the challenge of balancing model efficiency and information preservation for efficient ViTs. Inspired by graph summarization algorithms, GTP meticulously propagates less significant tokens' information to spatially and semantically connected tokens that are of greater importance. Consequently, the remaining few tokens serve as a summarization of the entire token graph, allowing the method to reduce computational complexity while preserving essential information of eliminated tokens. Combined with an innovative token selection strategy, GTP can efficiently identify image tokens to be propagated. Extensive experiments have validated GTP's effectiveness, demonstrating both efficiency and performance improvements. Specifically, GTP decreases the computational complexity of both DeiT-S and DeiT-B by up to 26% with only a minimal 0.3% accuracy drop on ImageNet-1K without finetuning, and remarkably surpasses the state-of-the-art token merging method on various backbones at an even faster inference speed. The source code is available at https://github.com/Ackesnal/GTP-ViT.
翻译:视觉Transformer(ViTs)已彻底改变了计算机视觉领域,但其在资源受限设备上的部署仍因高计算需求而面临挑战。为加速预训练ViTs,研究人员开发了标记剪枝与标记合并方法,旨在减少参与计算的标记数量。然而,这些方法仍存在局限性,例如剪枝标记导致的图像信息丢失以及标记匹配过程中的低效问题。本文提出一种新颖的基于图的标记传播(GTP)方法,以解决高效ViTs中模型效率与信息保留之间的平衡难题。受图摘要算法启发,GTP将次要标记的信息精心传播至空间与语义上更重要的关联标记。由此,剩余少量标记可作为整个标记图的摘要表示,从而在降低计算复杂度的同时保留被消除标记的关键信息。结合创新的标记选择策略,GTP能够高效识别需传播的图像标记。大量实验验证了GTP的有效性,在效率与性能上均实现提升。具体而言,GTP使DeiT-S和DeiT-B的计算复杂度最高降低26%,且在ImageNet-1K数据集上无需微调即可保持仅0.3%的精度下降,并在多种骨干网络上以更快的推理速度显著超越现有最优的标记合并方法。源代码已开源至https://github.com/Ackesnal/GTP-ViT。