The ability to detect objects in images at varying scales has played a pivotal role in the design of modern object detectors. Despite considerable progress in removing hand-crafted components and simplifying the architecture with transformers, multi-scale feature maps and/or pyramid design remain a key factor for their empirical success. In this paper, we show that this reliance on either feature pyramids or an hierarchical backbone is unnecessary and a transformer-based detector with scale-aware attention enables the plain detector `SimPLR' whose backbone and detection head are both non-hierarchical and operate on single-scale features. The plain architecture allows SimPLR to effectively take advantages of self-supervised learning and scaling approaches with ViTs, yielding competitive performance compared to hierarchical and multi-scale counterparts. We demonstrate through our experiments that when scaling to larger ViT backbones, SimPLR indicates better performance than end-to-end segmentation models (Mask2Former) and plain-backbone detectors (ViTDet), while consistently being faster. The code will be released.
翻译:图像中多尺度目标的检测能力一直是现代目标检测器设计的关键因素。尽管借助Transformer在去除手工设计组件、简化架构方面取得了显著进展,但多尺度特征图和/或金字塔设计仍是其经验成功的关键要素。本文证明,对特征金字塔或层次化骨干网络的依赖并非必要,具有尺度感知注意力的基于Transformer的检测器使得朴素检测器"SimPLR"得以实现——其骨干网络和检测头均非层次化结构,且仅基于单尺度特征运行。这种朴素架构使SimPLR能够有效利用自监督学习和ViT扩展方法的优势,在性能上与层次化及多尺度模型相抗衡。实验表明,当扩展至更大的ViT骨干网络时,SimPLR相比端到端分割模型(Mask2Former)和朴素骨干检测器(ViTDet)展现出更优性能,同时保持更快的推理速度。代码将公开发布。