Object localization in general environments is a fundamental part of vision systems. While dominating on the COCO benchmark, recent Transformer-based detection methods are not competitive in diverse domains. Moreover, these methods still struggle to very accurately estimate the object bounding boxes in complex environments. We introduce Cascade-DETR for high-quality universal object detection. We jointly tackle the generalization to diverse domains and localization accuracy by proposing the Cascade Attention layer, which explicitly integrates object-centric information into the detection decoder by limiting the attention to the previous box prediction. To further enhance accuracy, we also revisit the scoring of queries. Instead of relying on classification scores, we predict the expected IoU of the query, leading to substantially more well-calibrated confidences. Lastly, we introduce a universal object detection benchmark, UDB10, that contains 10 datasets from diverse domains. While also advancing the state-of-the-art on COCO, Cascade-DETR substantially improves DETR-based detectors on all datasets in UDB10, even by over 10 mAP in some cases. The improvements under stringent quality requirements are even more pronounced. Our code and models will be released at https://github.com/SysCV/cascade-detr.
翻译:在通用环境中进行目标定位是视觉系统的基础组成部分。尽管基于Transformer的检测方法在COCO基准上占据主导地位,但在多样化领域中却缺乏竞争力。此外,这些方法在复杂环境中对目标边界框的精确估计仍存在困难。我们提出了Cascade-DETR以实现高质量通用目标检测。通过提出Cascade Attention层——该层通过将注意力限制于前一轮边界框预测,显式地将目标中心信息整合到检测解码器中——我们共同解决了向多样化领域泛化与定位精度提升的问题。为进一步提高精度,我们重新审视了查询的评分机制。我们不再依赖分类分数,而是预测查询的预期IoU,从而获得显著更佳校准的置信度。最后,我们引入了一个通用目标检测基准UDB10,其中包含来自十个不同领域的数据集。Cascade-DETR在提升COCO最新成果的同时,显著改进了UDB10中所有数据集上基于DETR的检测器性能,某些情况下甚至提高了超过10 mAP。在严格质量要求下的改进更为显著。我们的代码和模型将发布在https://github.com/SysCV/cascade-detr。