Vision Transformers convert images to sequences by slicing them into patches. The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost, but changing the patch size typically requires retraining the model. In this paper, we demonstrate that simply randomizing the patch size at training time leads to a single set of weights that performs well across a wide range of patch sizes, making it possible to tailor the model to different compute budgets at deployment time. We extensively evaluate the resulting model, which we call FlexiViT, on a wide range of tasks, including classification, image-text retrieval, open-world detection, panoptic segmentation, and semantic segmentation, concluding that it usually matches, and sometimes outperforms, standard ViT models trained at a single patch size in an otherwise identical setup. Hence, FlexiViT training is a simple drop-in improvement for ViT that makes it easy to add compute-adaptive capabilities to most models relying on a ViT backbone architecture. Code and pre-trained models are available at https://github.com/google-research/big_vision
翻译:视觉 Transformer 通过将图像切片成补丁来将其转换为序列。补丁的大小控制着速度与准确率的权衡:较小的补丁以更高的计算成本带来更高的准确率,但改变补丁大小通常需要重新训练模型。在本文中,我们证明,仅在训练时随机化补丁大小即可得到一组权重,该权重在广泛的补丁大小范围内表现良好,从而能够在部署时根据不同的计算预算定制模型。我们广泛评估了由此产生的模型(称为 FlexiViT)在各种任务上的性能,包括分类、图像-文本检索、开放世界检测、全景分割和语义分割,并得出结论:在完全相同设置下,该模型通常与针对单一补丁大小训练的标准 ViT 模型性能相当,有时甚至更优。因此,FlexiViT 训练是对 ViT 的一种简单即插即用改进,使得大多数基于 ViT 骨干架构的模型能够轻松添加计算自适应能力。代码和预训练模型可在 https://github.com/google-research/big_vision 获取。