Image resizing operation is a fundamental preprocessing module in modern computer vision. Throughout the deep learning revolution, researchers have overlooked the potential of alternative resizing methods beyond the commonly used resizers that are readily available, such as nearest-neighbors, bilinear, and bicubic. The key question of our interest is whether the front-end resizer affects the performance of deep vision models? In this paper, we present an extremely lightweight multilayer Laplacian resizer with only a handful of trainable parameters, dubbed MULLER resizer. MULLER has a bandpass nature in that it learns to boost details in certain frequency subbands that benefit the downstream recognition models. We show that MULLER can be easily plugged into various training pipelines, and it effectively boosts the performance of the underlying vision task with little to no extra cost. Specifically, we select a state-of-the-art vision Transformer, MaxViT, as the baseline, and show that, if trained with MULLER, MaxViT gains up to 0.6% top-1 accuracy, and meanwhile enjoys 36% inference cost saving to achieve similar top-1 accuracy on ImageNet-1k, as compared to the standard training scheme. Notably, MULLER's performance also scales with model size and training data size such as ImageNet-21k and JFT, and it is widely applicable to multiple vision tasks, including image classification, object detection and segmentation, as well as image quality assessment.
翻译:图像缩放操作是现代计算机视觉中的基础预处理模块。在深度学习革命过程中,研究者们忽视了除常用缩放器(如最近邻、双线性、双三次插值)之外其他替代缩放方法的潜力。我们关注的核心问题是:前端缩放器是否会影响深度视觉模型的表现?本文提出了一种名为MULLER缩放器的超轻量级多层拉普拉斯缩放器,仅需少量可训练参数。MULLER具有带通特性,能够学习增强特定频率子带中的细节信息,从而有利于下游识别模型。研究表明,MULLER可轻松嵌入各类训练流程,以近乎零额外成本有效提升底层视觉任务性能。具体而言,我们选取当前最先进的视觉Transformer——MaxViT作为基线模型,实验显示:与标准训练方案相比,采用MULLER训练的MaxViT在ImageNet-1k上最高可提升0.6%的top-1准确率,同时为达到相近top-1准确率可节省36%的推理成本。值得注意的是,MULLER的性能随模型规模和训练数据量(如ImageNet-21k和JFT)同步提升,并广泛适用于图像分类、目标检测、语义分割及图像质量评估等多种视觉任务。