Despite the success of transformers on various computer vision tasks, they suffer from excessive memory and computational cost. Some works present dynamic vision transformers to accelerate inference by pruning redundant tokens. A key to improving token pruning is using well-trained models as initialization for faster convergence and better performance. However, current base models usually adopt full image training, i.e., using full images as inputs and keeping the whole feature maps through the forward process, which causes inconsistencies with dynamic models that gradually reduce tokens, including calculation pattern, information amount and token selection strategy inconsistencies. Inspired by MAE which performs masking and reconstruction self-supervised task, we devise masked fine-tuning to bridge the gaps between pre-trained base models used for initialization and token pruning based dynamic vision transformers, by masking image patches and predicting the image class label based on left unmasked patches. Extensive experiments on ImageNet demonstrate that base models via masked fine-tuning gain strong occlusion robustness and ability against information loss. With this better initialization, Dynamic ViT achieves higher accuracies, especially under large token pruning ratios (e.g., 81.9% vs. 81.3%, and 62.3% vs. 58.9% for DeiT based Dynamic ViT/0.8 and Dynamic ViT/0.3). Moreover, we apply our method into different token pruning based dynamic vision transformers, different pre-trained models and randomly initialized models to demonstrate the generalization ability.
翻译:尽管Transformer在各种计算机视觉任务中取得了成功,但其内存占用和计算成本过高。部分研究通过剪枝冗余分词提出了动态视觉Transformer以加速推理。提升分词剪枝的关键在于使用预训练良好的模型作为初始化,从而加快收敛速度并提升性能。然而,当前基础模型通常采用全图像训练(即输入完整图像并在前向过程中保留完整特征图),这与逐步减少分词数量的动态模型存在不一致,具体涉及计算模式、信息量和分词选择策略三方面的差异。受MAE(掩码自编码器)中掩码重构自监督任务的启发,我们设计了掩码微调方法,通过掩码图像块并基于剩余未掩码块预测图像类别标签,弥合用于初始化的预训练基础模型与基于分词剪枝的动态视觉Transformer之间的差距。在ImageNet上的大量实验表明,经掩码微调的基础模型具有强遮挡鲁棒性和信息损失抵抗能力。借助这种更优的初始化,动态ViT取得了更高准确率,尤其是在大分词剪枝比例下(例如,基于DeiT的动态ViT/0.8和动态ViT/0.3分别达到81.9% vs. 81.3%和62.3% vs. 58.9%)。此外,我们将该方法应用于不同的基于分词剪枝的动态视觉Transformer、不同的预训练模型及随机初始化模型,验证了其泛化能力。