Token merging has emerged as a new paradigm that can accelerate the inference of Vision Transformers (ViTs) without any retraining or fine-tuning. To push the frontier of training-free acceleration in ViTs, we improve token merging by adding the perspectives of 1) activation outliers and 2) hierarchical representations. Through a careful analysis of the attention behavior in ViTs, we characterize a delayed onset of the convergent attention phenomenon, which makes token merging undesirable in the bottom blocks of ViTs. Moreover, we augment token merging with a hierarchical processing scheme to capture multi-scale redundancy between visual tokens. Combining these two insights, we build a unified inference framework called DSM: Delayed Spatial Merging. We extensively evaluate DSM on various ViT model scales (Tiny to Huge) and tasks (ImageNet-1k and transfer learning), achieving up to 1.8$\times$ FLOP reduction and 1.6$\times$ throughput speedup at a negligible loss while being two orders of magnitude faster than existing methods.
翻译:令牌合并已成为一种无需任何重新训练或微调即可加速视觉Transformer(ViT)推理的新范式。为推进ViT无需训练加速的前沿,我们从以下两个视角改进令牌合并:1)激活异常值;2)层次化表示。通过对ViT中注意力行为的细致分析,我们刻画了收敛注意力现象的延迟发生特性,该特性使得令牌合并不适用于ViT的底层模块。此外,我们采用层次化处理方案增强令牌合并,以捕捉视觉令牌间的多尺度冗余。结合这两项洞见,我们构建了一个名为DSM(延迟空间合并)的统一推理框架。我们在多种ViT模型规模(从Tiny到Huge)和任务(ImageNet-1k及迁移学习)上对DSM进行了广泛评估,在精度损失可忽略不计的情况下实现了高达1.8倍的浮点运算量减少和1.6倍的吞吐加速,且速度比现有方法快两个数量级。