The ability to detect objects in images at varying scales has played a pivotal role in the design of modern object detectors. Despite considerable progress in removing hand-crafted components and simplifying the architecture with transformers, multi-scale feature maps and/or pyramid design remain a key factor for their empirical success. In this paper, we show that this reliance on either feature pyramids or an hierarchical backbone is unnecessary and a transformer-based detector with scale-aware attention enables the plain detector `SimPLR' whose backbone and detection head are both non-hierarchical and operate on single-scale features. We find through our experiments that SimPLR with scale-aware attention is plain and simple, yet competitive with multi-scale vision transformer alternatives. Compared to the multi-scale and single-scale state-of-the-art, our model scales much better with bigger capacity (self-supervised) models and more pre-training data, allowing us to report a consistently better accuracy and faster runtime for object detection, instance segmentation as well as panoptic segmentation. Code will be released.
翻译:在图像中检测不同尺度目标的能力,在现代目标检测器的设计中一直扮演着关键角色。尽管在去除手工设计组件、通过Transformer简化架构方面取得了显著进展,多尺度特征图和/或金字塔设计仍是其经验成功的关键因素。本文表明,这种对特征金字塔或层级骨干网络的依赖并非必要——基于Transformer且具有尺度感知注意力的检测器,使得“SimPLR”这种朴素检测器得以实现,其骨干网络和检测头均非层级结构,并在单尺度特征上运行。通过实验发现,采用尺度感知注意力的SimPLR既简洁又朴素,却能媲美多尺度视觉Transformer替代方案。与当前单尺度和多尺度最优方法相比,我们的模型在更大容量(自监督)模型和更多预训练数据下展现出更优的扩展性,从而在目标检测、实例分割以及全景分割任务中持续获得更高精度和更快运行速度。代码将公开。