The vision transformer is a model that breaks down each image into a sequence of tokens with a fixed length and processes them similarly to words in natural language processing. Although increasing the number of tokens typically results in better performance, it also leads to a considerable increase in computational cost. Motivated by the saying "A picture is worth a thousand words," we propose an innovative approach to accelerate the ViT model by shortening long images. Specifically, we introduce a method for adaptively assigning token length for each image at test time to accelerate inference speed. First, we train a Resizable-ViT (ReViT) model capable of processing input with diverse token lengths. Next, we extract token-length labels from ReViT that indicate the minimum number of tokens required to achieve accurate predictions. We then use these labels to train a lightweight Token-Length Assigner (TLA) that allocates the optimal token length for each image during inference. The TLA enables ReViT to process images with the minimum sufficient number of tokens, reducing token numbers in the ViT model and improving inference speed. Our approach is general and compatible with modern vision transformer architectures, significantly reducing computational costs. We verified the effectiveness of our methods on multiple representative ViT models on image classification and action recognition.
翻译:视觉Transformer是一种将每幅图像分解为固定长度令牌序列,并像处理自然语言中的单词一样处理这些令牌的模型。虽然增加令牌数量通常会带来更好的性能,但也会导致计算成本显著增加。受“一图胜千言”这一说法的启发,我们提出了一种创新方法,通过缩短长图像来加速ViT模型。具体来说,我们引入了一种在测试时自适应为每幅图像分配令牌长度以加速推理速度的方法。首先,我们训练了一个能够处理不同令牌长度输入的Resizable-ViT(ReViT)模型。接着,我们从ReViT中提取令牌长度标签,这些标签指示实现准确预测所需的最小令牌数。然后,我们利用这些标签训练一个轻量级的令牌长度分配器(TLA),在推理过程中为每幅图像分配最优令牌长度。TLA使ReViT能够在处理图像时使用最少的充足令牌,从而减少ViT模型中的令牌数量并提高推理速度。我们的方法具有通用性,可与现代视觉Transformer架构兼容,显著降低计算成本。我们在多个代表性ViT模型上,针对图像分类和动作识别任务验证了方法的有效性。