GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation

Vision Transformers (ViTs) have revolutionized the field of computer vision, yet their deployments on resource-constrained devices remain challenging due to high computational demands. To expedite pre-trained ViTs, token pruning and token merging approaches have been developed, which aim at reducing the number of tokens involved in the computation. However, these methods still have some limitations, such as image information loss from pruned tokens and inefficiency in the token-matching process. In this paper, we introduce a novel Graph-based Token Propagation (GTP) method to resolve the challenge of balancing model efficiency and information preservation for efficient ViTs. Inspired by graph summarization algorithms, GTP meticulously propagates less significant tokens' information to spatially and semantically connected tokens that are of greater importance. Consequently, the remaining few tokens serve as a summarization of the entire token graph, allowing the method to reduce computational complexity while preserving essential information of eliminated tokens. Combined with an innovative token selection strategy, GTP can efficiently identify image tokens to be propagated. Extensive experiments have validated GTP's effectiveness, demonstrating both efficiency and performance improvements. Specifically, GTP decreases the computational complexity of both DeiT-S and DeiT-B by up to 26% with only a minimal 0.3% accuracy drop on ImageNet-1K without finetuning, and remarkably surpasses the state-of-the-art token merging method on various backbones at an even faster inference speed. The source code is available at https://github.com/Ackesnal/GTP-ViT.

翻译：视觉Transformer（ViTs）已彻底变革计算机视觉领域，但其在资源受限设备上的部署仍因高计算需求而面临挑战。为加速预训练ViTs，研究者开发了令牌剪枝与令牌合并方法，旨在降低参与计算的令牌数量。然而，这些方法仍存在局限性，例如被剪枝令牌导致的图像信息丢失以及令牌匹配过程中的低效性。本文提出了一种新颖的基于图的令牌传播（Graph-based Token Propagation, GTP）方法，以解决高效ViTs中模型效率与信息保留之间的平衡难题。受图摘要算法启发，GTP精心地将次要令牌的信息传播至空间与语义上关联性更强的关键令牌。最终，保留下的少量令牌作为整个令牌图的摘要，使方法在降低计算复杂度的同时保留被消除令牌的关键信息。结合创新性的令牌选择策略，GTP能够高效识别需传播的图像令牌。大量实验验证了GTP的有效性，其在效率与性能上均展现出提升。具体而言，GTP在无需微调的情况下，将DeiT-S与DeiT-B的计算复杂度降低高达26%，且在ImageNet-1K上仅造成0.3%的精度下降，同时以更快的推理速度显著超越现有最先进的令牌合并方法。源代码已发布于https://github.com/Ackesnal/GTP-ViT。