Point-DETR3D: Leveraging Imagery Data with Spatial Point Prior for Weakly Semi-supervised 3D Object Detection

Training high-accuracy 3D detectors necessitates massive labeled 3D annotations with 7 degree-of-freedom, which is laborious and time-consuming. Therefore, the form of point annotations is proposed to offer significant prospects for practical applications in 3D detection, which is not only more accessible and less expensive but also provides strong spatial information for object localization.In this paper, we empirically discover that it is non-trivial to merely adapt Point-DETR to its 3D form, encountering two main bottlenecks: 1) it fails to encode strong 3D prior into the model, and 2) it generates low-quality pseudo labels in distant regions due to the extreme sparsity of LiDAR points. To overcome these challenges, we introduce Point-DETR3D, a teacher-student framework for weakly semi-supervised 3D detection, designed to fully capitalize on point-wise supervision within a constrained instance-wise annotation budget.Different from Point-DETR which encodes 3D positional information solely through a point encoder, we propose an explicit positional query initialization strategy to enhance the positional prior. Considering the low quality of pseudo labels at distant regions produced by the teacher model, we enhance the detector's perception by incorporating dense imagery data through a novel Cross-Modal Deformable RoI Fusion (D-RoI).Moreover, an innovative point-guided self-supervised learning technique is proposed to allow for fully exploiting point priors, even in student models.Extensive experiments on representative nuScenes dataset demonstrate our Point-DETR3D obtains significant improvements compared to previous works. Notably, with only 5% of labeled data, Point-DETR3D achieves over 90% performance of its fully supervised counterpart.

翻译：摘要：训练高精度3D检测器需要大量带有7自由度标注的三维注释，这既繁琐又耗时。因此，点标注的形式因其不仅更容易获取、成本更低，还能为目标定位提供强大的空间信息，为3D检测的实际应用提供了重要前景。本文通过实证发现，简单地将Point-DETR适配到三维形式并非易事，面临两个主要瓶颈：1）无法将强3D先验编码到模型中；2）由于激光雷达点的极端稀疏性，在远距离区域生成的伪标签质量低下。为克服这些挑战，我们提出Point-DETR3D——一种面向弱半监督3D检测的教师-学生框架，旨在有限实例级标注预算内充分利用点级监督。不同于Point-DETR仅通过点编码器编码三维位置信息，我们提出显式位置查询初始化策略以增强位置先验。针对教师模型在远距离区域生成的伪标签质量低的问题，我们通过新型跨模态可变形RoI融合（D-RoI）引入密集影像数据来增强检测器的感知能力。此外，本文提出创新的点引导自监督学习技术，使学生模型也能充分利用点先验。在代表性nuScenes数据集上的大量实验表明，Point-DETR3D相比现有方法取得了显著改进。值得注意的是，仅使用5%的标注数据，Point-DETR3D即可达到其全监督方法90%以上的性能。