Transformer-based methods have demonstrated superior performance for monocular 3D object detection recently, which aims at predicting 3D attributes from a single 2D image. Most existing transformer-based methods leverage both visual and depth representations to explore valuable query points on objects, and the quality of the learned query points has a great impact on detection accuracy. Unfortunately, existing unsupervised attention mechanisms in transformers are prone to generate low-quality query features due to inaccurate receptive fields, especially on hard objects. To tackle this problem, this paper proposes a novel Supervised Scale-aware Deformable Attention (SSDA) for monocular 3D object detection. Specifically, SSDA presets several masks with different scales and utilizes depth and visual features to adaptively learn a scale-aware filter for object query augmentation. Imposing the scale awareness, SSDA could well predict the accurate receptive field of an object query to support robust query feature generation. Aside from this, SSDA is assigned with a Weighted Scale Matching (WSM) loss to supervise scale prediction, which presents more confident results as compared to the unsupervised attention mechanisms. Extensive experiments on the KITTI benchmark demonstrate that SSDA significantly improves the detection accuracy, especially on moderate and hard objects, yielding state-of-the-art performance as compared to the existing approaches. Our code will be made publicly available at https://github.com/mikasa3lili/SSD-MonoDETR.
翻译:基于Transformer的方法近期在单目三维目标检测(从单张二维图像预测三维属性)中展现出卓越性能。现有基于Transformer的方法通常利用视觉与深度表征来探索目标上的有效查询点,而学习到的查询点质量对检测精度影响显著。然而,由于感受野不准确(尤其在困难目标上),现有Transformer中的无监督注意力机制容易生成低质量查询特征。为解决该问题,本文提出一种新颖的监督尺度感知可形变注意力机制(SSDA)用于单目三维目标检测。具体而言,SSDA预设了多个不同尺度的掩码,并利用深度与视觉特征自适应学习尺度感知滤波器以实现目标查询增强。通过引入尺度感知性,SSDA能够准确预测目标查询的感受野,从而支持鲁棒的查询特征生成。此外,本文为SSDA设计了加权尺度匹配(WSM)损失来监督尺度预测,相较于无监督注意力机制呈现出更可信的结果。在KITTI基准上的大量实验表明,SSDA显著提升了检测精度(尤其在中等和困难目标上),相比现有方法达到了最优性能。我们的代码将开源在https://github.com/mikasa3lili/SSD-MonoDETR。