In the field of monocular 3D detection, it is common practice to utilize scene geometric clues to enhance the detector's performance. However, many existing works adopt these clues explicitly such as estimating a depth map and back-projecting it into 3D space. This explicit methodology induces sparsity in 3D representations due to the increased dimensionality from 2D to 3D, and leads to substantial information loss, especially for distant and occluded objects. To alleviate this issue, we propose MonoNeRD, a novel detection framework that can infer dense 3D geometry and occupancy. Specifically, we model scenes with Signed Distance Functions (SDF), facilitating the production of dense 3D representations. We treat these representations as Neural Radiance Fields (NeRF) and then employ volume rendering to recover RGB images and depth maps. To the best of our knowledge, this work is the first to introduce volume rendering for M3D, and demonstrates the potential of implicit reconstruction for image-based 3D perception. Extensive experiments conducted on the KITTI-3D benchmark and Waymo Open Dataset demonstrate the effectiveness of MonoNeRD. Codes are available at https://github.com/cskkxjk/MonoNeRD.
翻译:在单目三维检测领域,利用场景几何线索增强检测器性能是常见做法。然而,现有研究多采用显式几何线索,例如估计深度图并将其反投影至三维空间。此类显式方法因从二维到三维的维度扩展导致三维表示稀疏性,尤其对远处和遮挡物体造成显著信息损失。为解决该问题,我们提出MonoNeRD——一种可推断密集三维几何结构与占用情况的新型检测框架。具体而言,我们采用有符号距离函数(SDF)对场景进行建模,从而生成密集三维表示;将这些表示视为神经辐射场(NeRF),进而通过体渲染恢复RGB图像与深度图。据我们所知,本工作是首次将体渲染引入单目三维检测(M3D)领域,并证明了隐式重建方法在基于图像的三维感知中的潜力。在KITTI-3D基准与Waymo开放数据集上的大量实验验证了MonoNeRD的有效性。代码见https://github.com/cskkxjk/MonoNeRD。