Multi-scale image resolution is a de facto standard approach in modern object detectors, such as DETR. This technique allows for the acquisition of various scale information from multiple image resolutions. However, manual hyperparameter selection of the resolution can restrict its flexibility, which is informed by prior knowledge, necessitating human intervention. This work introduces a novel strategy for learnable resolution, called Elastic-DETR, enabling elastic utilization of multiple image resolutions. Our network provides an adaptive scale factor based on the content of the image with a compact scale prediction module (< 2 GFLOPs). The key aspect of our method lies in how to determine the resolution without prior knowledge. We present two loss functions derived from identified key components for resolution optimization: scale loss, which increases adaptiveness according to the image, and distribution loss, which determines the overall degree of scaling based on network performance. By leveraging the resolution's flexibility, we can demonstrate various models that exhibit varying trade-offs between accuracy and computational complexity. We empirically show that our scheme can unleash the potential of a wide spectrum of image resolutions without constraining flexibility. Our models on MS COCO establish a maximum accuracy gain of 3.5%p or 26% decrease in computation than MS-trained DN-DETR.
翻译:多尺度图像分辨率是现代目标检测器(如DETR)中的一种事实标准方法。该技术允许从多种图像分辨率中获取不同尺度的信息。然而,基于先验知识手动选择分辨率的超参数会限制其灵活性,需要人工干预。本文提出了一种新颖的可学习分辨率策略,称为弹性DETR,能够弹性利用多种图像分辨率。我们的网络通过一个紧凑的尺度预测模块(< 2 GFLOPs)根据图像内容提供自适应尺度因子。我们方法的关键在于如何在没有先验知识的情况下确定分辨率。我们提出了两种从已识别的分辨率优化关键组件推导出的损失函数:尺度损失(根据图像内容增强自适应性)和分布损失(根据网络性能确定整体缩放程度)。通过利用分辨率的灵活性,我们可以展示多种在精度与计算复杂度之间具有不同权衡的模型。我们通过实验证明,我们的方案能够释放广泛图像分辨率谱的潜力,而不限制灵活性。我们在MS COCO数据集上的模型相比MS训练的DN-DETR,最高实现了3.5%的精度提升或26%的计算量降低。