Vision transformers (ViTs) have achieved promising results on a variety of Computer Vision tasks, however their quadratic complexity in the number of input tokens has limited their application specially in resource-constrained settings. Previous approaches that employ gradual token reduction to address this challenge assume that token redundancy in one layer implies redundancy in all the following layers. We empirically demonstrate that this assumption is often not correct, i.e., tokens that are redundant in one layer can be useful in later layers. We employ this key insight to propose a novel token propagation controller (TPC) that incorporates two different token-distributions, i.e., pause probability and restart probability to control the reduction and reuse of tokens respectively, which results in more efficient token utilization. To improve the estimates of token distributions, we propose a smoothing mechanism that acts as a regularizer and helps remove noisy outliers. Furthermore, to improve the training-stability of our proposed TPC, we introduce a model stabilizer that is able to implicitly encode local image structures and minimize accuracy fluctuations during model training. We present extensive experimental results on the ImageNet-1K dataset using DeiT, LV-ViT and Swin models to demonstrate the effectiveness of our proposed method. For example, compared to baseline models, our proposed method improves the inference speed of the DeiT-S by 250% while increasing the classification accuracy by 1.0%.
翻译:视觉Transformer(ViTs)已在多种计算机视觉任务中取得显著成果,然而其输入令牌数量的二次复杂度限制了其在资源受限场景中的应用。为应对这一挑战,现有方法采用逐步令牌缩减策略,但默认某层中的令牌冗余会在后续所有层中延续。我们通过实验证明该假设通常不成立——即某一层冗余的令牌可能在后续层中发挥重要作用。基于这一关键发现,我们提出了一种新型令牌传播控制器(TPC),该控制器融合两种不同的令牌分布机制:暂停概率和重启概率,分别控制令牌缩减与复用,从而实现更高效的令牌利用。为提升令牌分布估计的准确性,我们提出了一种充当正则化器的平滑机制,能够有效去除噪声异常值。此外,为增强TPC的训练稳定性,我们引入了一个模型稳定器,该稳定器可隐式编码局部图像结构,并在模型训练过程中最小化精度波动。我们在ImageNet-1K数据集上使用DeiT、LV-ViT和Swin模型进行了大量实验,验证了所提出方法的有效性。例如,与基线模型相比,本方法使DeiT-S的推理速度提升250%,同时分类准确率提升1.0%。