Recently, sparse 3D convolutions have changed 3D object detection. Performing on par with the voting-based approaches, 3D CNNs are memory-efficient and scale to large scenes better. However, there is still room for improvement. With a conscious, practice-oriented approach to problem-solving, we analyze the performance of such methods and localize the weaknesses. Applying modifications that resolve the found issues one by one, we end up with TR3D: a fast fully-convolutional 3D object detection model trained end-to-end, that achieves state-of-the-art results on the standard benchmarks, ScanNet v2, SUN RGB-D, and S3DIS. Moreover, to take advantage of both point cloud and RGB inputs, we introduce an early fusion of 2D and 3D features. We employ our fusion module to make conventional 3D object detection methods multimodal and demonstrate an impressive boost in performance. Our model with early feature fusion, which we refer to as TR3D+FF, outperforms existing 3D object detection approaches on the SUN RGB-D dataset. Overall, besides being accurate, both TR3D and TR3D+FF models are lightweight, memory-efficient, and fast, thereby marking another milestone on the way toward real-time 3D object detection. Code is available at https://github.com/SamsungLabs/tr3d .
翻译:近期,稀疏三维卷积已改变了三维目标检测领域。与基于投票的方法性能相当的同时,三维卷积神经网络具备内存高效的优势,并能更好地扩展到大规模场景。然而,这一领域仍存在改进空间。我们以严谨、面向实践的解决问题思路,分析了此类方法的性能并定位其薄弱环节。通过逐步修正发现的问题,最终提出TR3D:一种端到端训练、快速的全卷积三维目标检测模型,在ScanNet v2、SUN RGB-D和S3DIS等标准基准测试中达到最先进水平。此外,为充分利用点云与RGB输入的双重优势,我们引入了二维与三维特征的早期融合技术,并采用该融合模块使传统三维目标检测方法具备多模态能力,实验表明性能获得显著提升。我们提出的早期特征融合模型(记为TR3D+FF)在SUN RGB-D数据集上超越现有三维目标检测方法。总体而言,TR3D与TR3D+FF模型不仅在精度上表现优异,还具备轻量化、内存高效和快速处理的特性,这标志着向实时三维目标检测迈出重要一步。代码已开源:https://github.com/SamsungLabs/tr3d