The input tokens to Vision Transformers carry little semantic meaning as they are defined as regular equal-sized patches of the input image, regardless of its content. However, processing uniform background areas of an image should not necessitate as much compute as dense, cluttered areas. To address this issue, we propose a dynamic mixed-scale tokenization scheme for ViT, MSViT. Our method introduces a conditional gating mechanism that selects the optimal token scale for every image region, such that the number of tokens is dynamically determined per input. In addition, to enhance the conditional behavior of the gate during training, we introduce a novel generalization of the batch-shaping loss. We show that our gating module is able to learn meaningful semantics despite operating locally at the coarse patch-level. The proposed gating module is lightweight, agnostic to the choice of transformer backbone, and trained within a few epochs with little training overhead. Furthermore, in contrast to token pruning, MSViT does not lose information about the input, thus can be readily applied for dense tasks. We validate MSViT on the tasks of classification and segmentation where it leads to improved accuracy-complexity trade-off.
翻译:视觉Transformer的输入分词通常将图像均匀划分为固定大小的图块,因而难以承载丰富的语义信息。然而,对图像中均匀背景区域的计算处理,不应与密集复杂区域消耗同等算力。为解决该问题,我们提出面向ViT的动态混合尺度分词方案MSViT。该方法引入条件门控机制,为每个图像区域选择最优分词尺度,从而根据输入内容动态确定分词数量。此外,为增强训练过程中门控的条件行为,我们提出批形状损失函数的新型泛化形式。实验表明,该门控模块即便在粗粒度图块层级进行操作,仍能学习到有意义的语义信息。所提出的门控模块轻量化,与Transformer骨干网络选择无关,且仅需少量额外训练开销即可在数轮训练中完成适配。与分词剪枝方法不同,MSViT不会丢失输入信息,因而可直接应用于密集预测任务。我们在分类与分割任务上验证了MSViT,其实现了更优的精度-复杂度平衡。