Although Vision Transformers (ViTs) have recently advanced computer vision tasks significantly, an important real-world problem was overlooked: adapting to variable input resolutions. Typically, images are resized to a fixed resolution, such as 224x224, for efficiency during training and inference. However, uniform input size conflicts with real-world scenarios where images naturally vary in resolution. Modifying the preset resolution of a model may severely degrade the performance. In this work, we propose to enhance the model adaptability to resolution variation by optimizing the patch embedding. The proposed method, called Multi-Scale Patch Embedding (MSPE), substitutes the standard patch embedding with multiple variable-sized patch kernels and selects the best parameters for different resolutions, eliminating the need to resize the original image. Our method does not require high-cost training or modifications to other parts, making it easy to apply to most ViT models. Experiments in image classification, segmentation, and detection tasks demonstrate the effectiveness of MSPE, yielding superior performance on low-resolution inputs and performing comparably on high-resolution inputs with existing methods.
翻译:尽管视觉Transformer(ViT)近年来显著推动了计算机视觉任务的发展,一个重要现实问题却被忽视:适应可变输入分辨率。通常,为了训练和推理效率,图像会被调整为固定分辨率(如224×224)。然而,统一的输入尺寸与真实场景中图像分辨率天然多变的特点相冲突。修改模型的预设分辨率可能导致性能严重下降。本研究提出通过优化补丁嵌入来增强模型对分辨率变化的适应能力。所提出的方法称为多尺度补丁嵌入(MSPE),它用多个可变尺寸的补丁核替换标准补丁嵌入,并为不同分辨率选择最佳参数,从而无需调整原始图像尺寸。该方法无需高成本训练或修改其他模块,可轻松应用于大多数ViT模型。在图像分类、分割和检测任务上的实验验证了MSPE的有效性:在低分辨率输入上获得更优性能,在高分辨率输入上与现有方法表现相当。