The input tokens to Vision Transformers carry little semantic meaning as they are defined as regular equal-sized patches of the input image, regardless of its content. However, processing uniform background areas of an image should not necessitate as much compute as dense, cluttered areas. To address this issue, we propose a dynamic mixed-scale tokenization scheme for ViT, MSViT. Our method introduces a conditional gating mechanism that selects the optimal token scale for every image region, such that the number of tokens is dynamically determined per input. The proposed gating module is lightweight, agnostic to the choice of transformer backbone, and trained within a few epochs (e.g., 20 epochs on ImageNet) with little training overhead. In addition, to enhance the conditional behavior of the gate during training, we introduce a novel generalization of the batch-shaping loss. We show that our gating module is able to learn meaningful semantics despite operating locally at the coarse patch-level. We validate MSViT on the tasks of classification and segmentation where it leads to improved accuracy-complexity trade-off.
翻译:视觉变换器的输入令牌缺乏语义含义,因其被定义为输入图像的固定大小规则块,与图像内容无关。然而,处理图像均匀背景区域所需的计算量不应与处理密集杂乱区域相当。为解决这一问题,我们提出针对ViT的动态混合尺度令牌化方案——MSViT。该方法引入条件门控机制,为每个图像区域选择最优令牌尺度,使得每个输入的令牌数量动态确定。所提出的门控模块轻量化,与变换器主干的选择无关,并且在较少训练周期内即可完成训练(例如在ImageNet上训练20个周期),且训练开销极低。此外,为增强训练过程中门控的条件行为,我们提出了一种批形状损失的新型泛化形式。实验表明,我们的门控模块尽管在粗糙的块级别局部操作,仍能学习有意义的语义信息。我们在分类和分割任务上验证了MSViT,该模型在准确率-计算复杂度之间实现了更优的权衡。