Recently, sparse 3D convolutions have changed 3D object detection. Performing on par with the voting-based approaches, 3D CNNs are memory-efficient and scale to large scenes better. However, there is still room for improvement. With a conscious, practice-oriented approach to problem-solving, we analyze the performance of such methods and localize the weaknesses. Applying modifications that resolve the found issues one by one, we end up with TR3D: a fast fully-convolutional 3D object detection model trained end-to-end, that achieves state-of-the-art results on the standard benchmarks, ScanNet v2, SUN RGB-D, and S3DIS. Moreover, to take advantage of both point cloud and RGB inputs, we introduce an early fusion of 2D and 3D features. We employ our fusion module to make conventional 3D object detection methods multimodal and demonstrate an impressive boost in performance. Our model with early feature fusion, which we refer to as TR3D+FF, outperforms existing 3D object detection approaches on the SUN RGB-D dataset. Overall, besides being accurate, both TR3D and TR3D+FF models are lightweight, memory-efficient, and fast, thereby marking another milestone on the way toward real-time 3D object detection. Code is available at https://github.com/SamsungLabs/tr3d .
翻译:近期,稀疏三维卷积改变了三维目标检测领域。与基于投票的方法性能相当,三维CNN在内存效率上更优,且能更好地扩展到大规模场景。然而,仍有改进空间。我们采用有意识的、面向实践的问题解决方法,分析了此类方法的性能并定位其弱点。通过逐一应用修改来解决发现的问题,最终得到TR3D:一个端到端训练的快速全卷积三维目标检测模型,在标准基准数据集ScanNet v2、SUN RGB-D和S3DIS上取得了最先进的结果。此外,为同时利用点云和RGB输入,我们引入了二维与三维特征的早期融合。采用融合模块使传统三维目标检测方法具备多模态能力,并展示出显著的性能提升。我们的早期特征融合模型(称为TR3D+FF)在SUN RGB-D数据集上超越了现有三维目标检测方法。总体而言,除准确性外,TR3D和TR3D+FF模型均轻量、内存高效且速度快,从而在迈向实时三维目标检测的道路上树立了又一个里程碑。代码已开源至https://github.com/SamsungLabs/tr3d。