We attempt to reduce the computational costs in vision transformers (ViTs), which increase quadratically in the token number. We present a novel training paradigm that trains only one ViT model at a time, but is capable of providing improved image recognition performance with various computational costs. Here, the trained ViT model, termed super vision transformer (SuperViT), is empowered with the versatile ability to solve incoming patches of multiple sizes as well as preserve informative tokens with multiple keeping rates (the ratio of keeping tokens) to achieve good hardware efficiency for inference, given that the available hardware resources often change from time to time. Experimental results on ImageNet demonstrate that our SuperViT can considerably reduce the computational costs of ViT models with even performance increase. For example, we reduce 2x FLOPs of DeiT-S while increasing the Top-1 accuracy by 0.2% and 0.7% for 1.5x reduction. Also, our SuperViT significantly outperforms existing studies on efficient vision transformers. For example, when consuming the same amount of FLOPs, our SuperViT surpasses the recent state-of-the-art (SOTA) EViT by 1.1% when using DeiT-S as their backbones. The project of this work is made publicly available at https://github.com/lmbxmu/SuperViT.
翻译:我们致力于降低视觉Transformer(ViTs)的计算成本——该成本随令牌数量呈二次方增长。本文提出一种新型训练范式,每次仅训练一个ViT模型,却能以不同计算成本实现更优的图像识别性能。该训练后的ViT模型被命名为"超级视觉Transformer(SuperViT)",其具备多尺寸输入补丁的灵活处理能力,并能通过多保留率(令牌保留比例)保留信息性令牌,从而根据不同硬件资源条件实现高效的推理性能。在ImageNet上的实验表明,我们的SuperViT在显著降低ViT模型计算成本的同时,甚至能提升性能。例如,当DeiT-S模型计算量降低2倍时,Top-1准确率上升0.2%;当计算量降低1.5倍时,准确率提升0.7%。此外,SuperViT在高效视觉Transformer领域显著优于现有研究。例如,在相同计算量下,以DeiT-S为骨干网络时,我们的SuperViT比当前最先进的(SOTA)EViT方法高1.1%的准确率。本项目代码已开源至https://github.com/lmbxmu/SuperViT。