In the field of monocular 3D detection, it is common practice to utilize scene geometric clues to enhance the detector's performance. However, many existing works adopt these clues explicitly such as estimating a depth map and back-projecting it into 3D space. This explicit methodology induces sparsity in 3D representations due to the increased dimensionality from 2D to 3D, and leads to substantial information loss, especially for distant and occluded objects. To alleviate this issue, we propose MonoNeRD, a novel detection framework that can infer dense 3D geometry and occupancy. Specifically, we model scenes with Signed Distance Functions (SDF), facilitating the production of dense 3D representations. We treat these representations as Neural Radiance Fields (NeRF) and then employ volume rendering to recover RGB images and depth maps. To the best of our knowledge, this work is the first to introduce volume rendering for M3D, and demonstrates the potential of implicit reconstruction for image-based 3D perception. Extensive experiments conducted on the KITTI-3D benchmark and Waymo Open Dataset demonstrate the effectiveness of MonoNeRD. Codes are available at https://github.com/cskkxjk/MonoNeRD.
翻译:在单目3D检测领域,利用场景几何线索提升检测器性能是常见做法。然而,现有工作多采用显式线索(如估计深度图并反投影至3D空间)。这种显式方法因2D到3D的维度增加导致3D表示稀疏,尤其对远距离和遮挡目标会造成显著信息损失。为解决此问题,我们提出MonoNeRD——一种能够推断稠密3D几何与占用状态的新型检测框架。具体而言,我们利用有符号距离函数(SDF)对场景建模,从而生成稠密3D表示。我们将这些表示视为神经辐射场(NeRF),并通过体渲染恢复RGB图像与深度图。据我们所知,本文首次将体渲染引入单目3D检测(M3D),揭示了隐式重建对基于图像的3D感知的潜力。在KITTI-3D基准和Waymo开放数据集上的大量实验验证了MonoNeRD的有效性。代码已开源:https://github.com/cskkxjk/MonoNeRD。