The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution before processing them with computer vision models has not yet been successfully challenged. However, models such as the Vision Transformer (ViT) offer flexible sequence-based modeling, and hence varying input sequence lengths. We take advantage of this with NaViT (Native Resolution ViT) which uses sequence packing during training to process inputs of arbitrary resolutions and aspect ratios. Alongside flexible model usage, we demonstrate improved training efficiency for large-scale supervised and contrastive image-text pretraining. NaViT can be efficiently transferred to standard tasks such as image and video classification, object detection, and semantic segmentation and leads to improved results on robustness and fairness benchmarks. At inference time, the input resolution flexibility can be used to smoothly navigate the test-time cost-performance trade-off. We believe that NaViT marks a departure from the standard, CNN-designed, input and modelling pipeline used by most computer vision models, and represents a promising direction for ViTs.
翻译:视觉模型在处理图像前普遍采用将其缩放到固定分辨率的做法,这种看似合理却非最优的选择至今未被成功突破。然而,视觉Transformer(ViT)等模型支持基于序列的灵活建模,因此可处理可变长度的输入序列。我们利用这一特性提出NaViT(原生分辨率ViT),通过训练过程中的序列封装(sequence packing)技术,使模型能够处理任意分辨率和宽高比的输入。在保持模型使用灵活性的同时,我们证明该方法在大规模监督学习和对比式图文预训练中能显著提升训练效率。NaViT可高效迁移至图像与视频分类、目标检测、语义分割等标准任务,并在鲁棒性与公平性基准测试中取得更优结果。推理阶段,输入分辨率灵活性可用于平滑调整测试时的成本-性能平衡。我们认为,NaViT标志着对大多数视觉模型所采用的基于CNN设计的标准输入与建模管线的突破,为视觉Transformer的发展开辟了新方向。