Transformer-based methods have swept the benchmarks on 2D and 3D detection on images. Because tokenization before the attention mechanism drops the spatial information, positional encoding becomes critical for those methods. Recent works found that encodings based on samples of the 3D viewing rays can significantly improve the quality of multi-camera 3D object detection. We hypothesize that 3D point locations can provide more information than rays. Therefore, we introduce 3D point positional encoding, 3DPPE, to the 3D detection Transformer decoder. Although 3D measurements are not available at the inference time of monocular 3D object detection, 3DPPE uses predicted depth to approximate the real point positions. Our hybriddepth module combines direct and categorical depth to estimate the refined depth of each pixel. Despite the approximation, 3DPPE achieves 46.0 mAP and 51.4 NDS on the competitive nuScenes dataset, significantly outperforming encodings based on ray samples. We make the codes available at https://github.com/drilistbox/3DPPE.
翻译:基于Transformer的方法已横扫2D和3D图像检测领域的基准测试。由于注意力机制前的分词操作会丢失空间信息,位置编码对此类方法至关重要。近期研究发现,基于三维视射线采样的编码能显著提升多相机3D目标检测的质量。我们假设三维点位置比射线蕴含更丰富的信息。为此,我们向3D检测Transformer解码器引入三维点位置编码(3DPPE)。尽管单目3D目标检测推理时无法获取三维测量数据,3DPPE通过预测深度近似真实点位置。我们提出的混合深度模块结合直接深度与分类深度,为每个像素估计精化深度。即便采用近似方法,3DPPE在具有挑战性的nuScenes数据集上仍达到46.0 mAP和51.4 NDS,显著优于基于射线采样的编码方法。相关代码已开源至 https://github.com/drilistbox/3DPPE。