FreqPDE: Rethinking Positional Depth Embedding for Multi-View 3D Object Detection Transformers

Detecting 3D objects accurately from multi-view 2D images is a challenging yet essential task in the field of autonomous driving. Current methods resort to integrating depth prediction to recover the spatial information for object query decoding, which necessitates explicit supervision from LiDAR points during the training phase. However, the predicted depth quality is still unsatisfactory such as depth discontinuity of object boundaries and indistinction of small objects, which are mainly caused by the sparse supervision of projected points and the use of high-level image features for depth prediction. Besides, cross-view consistency and scale invariance are also overlooked in previous methods. In this paper, we introduce Frequency-aware Positional Depth Embedding (FreqPDE) to equip 2D image features with spatial information for 3D detection transformer decoder, which can be obtained through three main modules. Specifically, the Frequency-aware Spatial Pyramid Encoder (FSPE) constructs a feature pyramid by combining high-frequency edge clues and low-frequency semantics from different levels respectively. Then the Cross-view Scale-invariant Depth Predictor (CSDP) estimates the pixel-level depth distribution with cross-view and efficient channel attention mechanism. Finally, the Positional Depth Encoder (PDE) combines the 2D image features and 3D position embeddings to generate the 3D depth-aware features for query decoding. Additionally, hybrid depth supervision is adopted for complementary depth learning from both metric and distribution aspects. Extensive experiments conducted on the nuScenes dataset demonstrate the effectiveness and superiority of our proposed method.

翻译：从多视图二维图像中精确检测三维目标是自动驾驶领域一项具有挑战性但至关重要的任务。现有方法通常通过集成深度预测来恢复用于目标查询解码的空间信息，这需要在训练阶段利用激光雷达点云进行显式监督。然而，预测的深度质量仍不尽如人意，例如目标边界的深度不连续和小目标难以区分，这主要是由投影点的稀疏监督以及使用高层图像特征进行深度预测所导致的。此外，先前方法也忽视了跨视图一致性和尺度不变性。本文提出频率感知位置深度嵌入（FreqPDE），为三维检测Transformer解码器中的二维图像特征赋予空间信息，该嵌入可通过三个主要模块获得。具体而言，频率感知空间金字塔编码器（FSPE）通过分别融合来自不同层级的的高频边缘线索和低频语义来构建特征金字塔。随后，跨视图尺度不变深度预测器（CSDP）利用跨视图高效通道注意力机制估计像素级深度分布。最后，位置深度编码器（PDE）结合二维图像特征和三维位置嵌入，生成用于查询解码的三维深度感知特征。此外，我们采用混合深度监督，从度量与分布两个角度进行互补的深度学习。在nuScenes数据集上进行的大量实验证明了所提方法的有效性和优越性。