Transformer-based methods have demonstrated superior performance for monocular 3D object detection recently, which predicts 3D attributes from a single 2D image. Most existing transformer-based methods leverage visual and depth representations to explore valuable query points on objects, and the quality of the learned queries has a great impact on detection accuracy. Unfortunately, existing unsupervised attention mechanisms in transformer are prone to generate low-quality query features due to inaccurate receptive fields, especially on hard objects. To tackle this problem, this paper proposes a novel ``Supervised Scale-constrained Deformable Attention'' (SSDA) for monocular 3D object detection. Specifically, SSDA presets several masks with different scales and utilizes depth and visual features to predict the local feature for each query. Imposing the scale constraint, SSDA could well predict the accurate receptive field of a query to support robust query feature generation. What is more, SSDA is assigned with a Weighted Scale Matching (WSM) loss to supervise scale prediction, which presents more confident results as compared to the unsupervised attention mechanisms. Extensive experiments on ``KITTI'' demonstrate that SSDA significantly improves the detection accuracy especially on moderate and hard objects, yielding SOTA performance as compared to the existing approaches. Code will be publicly available at https://github.com/mikasa3lili/SSD-MonoDETR.
翻译:基于Transformer的方法最近在单目三维目标检测中展现出优越性能,这类方法通过单张二维图像预测三维属性。现有的大多数基于Transformer的方法利用视觉和深度表征来探索目标上的有价值查询点,而学习到的查询质量对检测精度具有重要影响。然而,Transformer中现有的无监督注意力机制容易因不准确的感受野(尤其在困难目标上)产生低质量查询特征。为解决此问题,本文提出一种新颖的“监督尺度约束可变形注意力”(SSDA)用于单目三维目标检测。具体而言,SSDA预设多个不同尺度的掩膜,并利用深度和视觉特征为每个查询预测局部特征。通过施加尺度约束,SSDA能够准确预测查询的感受野以支持鲁棒的查询特征生成。此外,SSDA被赋予加权尺度匹配(WSM)损失来监督尺度预测,相较于无监督注意力机制呈现出更可靠的结果。在KITTI数据集上的大量实验表明,SSDA显著提升了检测精度(尤其在中度和困难目标上),相较于现有方法达到了最先进的性能。代码将开源在https://github.com/mikasa3lili/SSD-MonoDETR。