Multi-scale features have been proven highly effective for object detection but often come with huge and even prohibitive extra computation costs, especially for the recent Transformer-based detectors. In this paper, we propose Iterative Multi-scale Feature Aggregation (IMFA) -- a generic paradigm that enables efficient use of multi-scale features in Transformer-based object detectors. The core idea is to exploit sparse multi-scale features from just a few crucial locations, and it is achieved with two novel designs. First, IMFA rearranges the Transformer encoder-decoder pipeline so that the encoded features can be iteratively updated based on the detection predictions. Second, IMFA sparsely samples scale-adaptive features for refined detection from just a few keypoint locations under the guidance of prior detection predictions. As a result, the sampled multi-scale features are sparse yet still highly beneficial for object detection. Extensive experiments show that the proposed IMFA boosts the performance of multiple Transformer-based object detectors significantly yet with only slight computational overhead.
翻译:多尺度特征已被证明对目标检测高度有效,但通常伴随巨大甚至不可接受的额外计算开销,尤其是在最近的基于Transformer的检测器中。本文提出迭代多尺度特征聚合(IMFA)——一种通用范式,能够高效利用Transformer目标检测器中的多尺度特征。其核心思想是仅从少数关键位置利用稀疏多尺度特征,并通过两项创新设计实现。首先,IMFA重新编排Transformer编码器-解码器流水线,使得编码特征能够基于检测预测结果迭代更新。其次,IMFA在先前检测预测的引导下,从少数关键点位置稀疏采样自适应尺度的特征以实现精细化检测。因此,所采样的多尺度特征虽稀疏,但对目标检测仍高度有益。大量实验表明,所提出的IMFA在仅引入轻微计算开销的情况下,显著提升了多种基于Transformer的检测器的性能。