Token filtering to reduce irrelevant tokens prior to self-attention is a straightforward way to enable efficient vision Transformer. This is the first work to view token filtering from a feature selection perspective, where we weigh the importance of a token according to how much it can change the loss once masked. If the loss changes greatly after masking a token of interest, it means that such a token has a significant impact on the final decision and is thus relevant. Otherwise, the token is less important for the final decision, so it can be filtered out. After applying the token filtering module generalized from the whole training data, the token number fed to the self-attention module can be obviously reduced in the inference phase, leading to much fewer computations in all the subsequent self-attention layers. The token filter can be realized using a very simple network, where we utilize multi-layer perceptron. Except for the uniqueness of performing token filtering only once from the very beginning prior to self-attention, the other core feature making our method different from the other token filters lies in the predictability of token impact from a feature selection point of view. The experiments show that the proposed method provides an efficient way to approach a light weighted model after optimized with a backbone by means of fine tune, which is easy to be deployed in comparison with the existing methods based on training from scratch.
翻译:在自注意力机制前过滤无关令牌以减少计算量,是实现高效视觉Transformer的直接途径。本文首次从特征选择视角审视令牌过滤问题,通过评估某个令牌被遮蔽后损失的变化程度来量化其重要性。若遮蔽该令牌后损失显著变化,表明该令牌对最终决策具有重要影响,因此是相关的;反之,该令牌对最终决策影响较小,可予以过滤。通过应用基于全训练数据泛化的令牌过滤模块,推理阶段输入自注意力层的令牌数量可显著减少,从而大幅降低后续所有自注意力层的计算量。令牌过滤器可采用极简网络实现,本文选用多层感知机。除在自注意力前仅执行一次令牌过滤的独特性外,本方法区别于其他令牌过滤器的核心特征在于:能够从特征选择角度预测令牌影响。实验表明,本方法通过微调与骨干网络联合优化后,可高效获得轻量化模型,相较于现有基于从头训练的方法更易于部署。