Transformer-based methods have demonstrated superior performance for monocular 3D object detection recently, which aims at predicting 3D attributes from a single 2D image. Most existing transformer-based methods leverage both visual and depth representations to explore valuable query points on objects, and the quality of the learned query points has a great impact on detection accuracy. Unfortunately, existing unsupervised attention mechanisms in transformers are prone to generate low-quality query features due to inaccurate receptive fields, especially on hard objects. To tackle this problem, this paper proposes a novel Supervised Scale-aware Deformable Attention (SSDA) for monocular 3D object detection. Specifically, SSDA presets several masks with different scales and utilizes depth and visual features to adaptively learn a scale-aware filter for object query augmentation. Imposing the scale awareness, SSDA could well predict the accurate receptive field of an object query to support robust query feature generation. Aside from this, SSDA is assigned with a Weighted Scale Matching (WSM) loss to supervise scale prediction, which presents more confident results as compared to the unsupervised attention mechanisms. Extensive experiments on the KITTI benchmark demonstrate that SSDA significantly improves the detection accuracy, especially on moderate and hard objects, yielding state-of-the-art performance as compared to the existing approaches. Our code will be made publicly available at https://github.com/mikasa3lili/SSD-MonoDETR.
翻译:基于Transformer的方法近期在单目三维目标检测中展现出卓越性能,该任务旨在从单张二维图像中预测三维属性。现有基于Transformer的方法大多利用视觉与深度表征探索目标上的有效查询点,而学习到的查询点质量对检测精度有显著影响。然而,现有Transformer中无监督注意力机制因感受野不精准,尤其对困难目标易生成低质量查询特征。针对该问题,本文提出一种面向单目三维目标检测的新型监督尺度感知变形注意力(SSDA)。具体而言,SSDA预设多尺度掩膜,利用深度与视觉特征自适应学习尺度感知滤波器以增强目标查询。通过施加尺度感知能力,SSDA能精准预测目标查询的感受野,从而支持鲁棒查询特征生成。此外,SSDA采用加权尺度匹配(WSM)损失监督尺度预测,相较于无监督注意力机制可获得更可靠的结果。在KITTI基准上的大量实验表明,SSDA显著提升检测精度,尤其对中等和困难目标效果显著,相比现有方法达到最先进性能。我们的代码将在https://github.com/mikasa3lili/SSD-MonoDETR公开。