Vision transformers (ViTs) have achieved promising results on a variety of Computer Vision tasks, however their quadratic complexity in the number of input tokens has limited their application specially in resource-constrained settings. Previous approaches that employ gradual token reduction to address this challenge assume that token redundancy in one layer implies redundancy in all the following layers. We empirically demonstrate that this assumption is often not correct, i.e., tokens that are redundant in one layer can be useful in later layers. We employ this key insight to propose a novel token propagation controller (TPC) that incorporates two different token-distributions, i.e., pause probability and restart probability to control the reduction and reuse of tokens respectively, which results in more efficient token utilization. To improve the estimates of token distributions, we propose a smoothing mechanism that acts as a regularizer and helps remove noisy outliers. Furthermore, to improve the training-stability of our proposed TPC, we introduce a model stabilizer that is able to implicitly encode local image structures and minimize accuracy fluctuations during model training. We present extensive experimental results on the ImageNet-1K dataset using DeiT, LV-ViT and Swin models to demonstrate the effectiveness of our proposed method. For example, compared to baseline models, our proposed method improves the inference speed of the DeiT-S by 250% while increasing the classification accuracy by 1.0%.
翻译:视觉Transformer(ViT)在各种计算机视觉任务中取得了显著成果,但其在输入令牌数量上的二次复杂度限制了其在资源受限场景中的应用。以往采用逐步令牌缩减策略解决该问题的方法假设:某一层中的令牌冗余意味着后续所有层中的令牌冗余。我们通过实验证明,这一假设往往不成立,即某一层冗余的令牌可能在后续层中发挥作用。基于这一关键发现,我们提出了一种新型令牌传播控制器(TPC),该控制器结合两种不同的令牌分布——暂停概率与重启概率——分别控制令牌的缩减与重用,从而实现了更高效的令牌利用。为提高令牌分布的估计精度,我们提出一种充当正则化器的平滑机制,有助于去除噪声异常值。此外,为增强TPC的训练稳定性,我们引入一种模型稳定器,该稳定器能够隐式编码局部图像结构,并在模型训练过程中最小化精度波动。我们利用DeiT、LV-ViT和Swin模型在ImageNet-1K数据集上进行了大量实验,以证明所提方法的有效性。例如,与基准模型相比,所提方法使DeiT-S的推理速度提升250%,同时分类准确率提高1.0%。