Robotic perception requires the modeling of both 3D geometry and semantics. Existing methods typically focus on estimating 3D bounding boxes, neglecting finer geometric details and struggling to handle general, out-of-vocabulary objects. To overcome these limitations, we introduce a novel task for 3D occupancy prediction, which aims to estimate the detailed occupancy and semantics of objects from multi-view images. To facilitate this task, we develop a label generation pipeline that produces dense, visibility-aware labels for a given scene. This pipeline includes point cloud aggregation, point labeling, and occlusion handling. We construct two benchmarks based on the Waymo Open Dataset and the nuScenes Dataset, resulting in the Occ3D-Waymo and Occ3D-nuScenes benchmarks. Lastly, we propose a model, dubbed Coarse-to-Fine Occupancy (CTF-Occ) network, which demonstrates superior performance in the 3D occupancy prediction task. This approach addresses the need for finer geometric understanding in a coarse-to-fine fashion. The code, data, and benchmarks are released at https://tsinghua-mars-lab.github.io/Occ3D/.
翻译:机器人感知需要同时建模三维几何与语义信息。现有方法通常集中于估计三维边界框,忽略了更精细的几何细节,且难以处理通用的、词汇表外物体。为克服这些局限,我们提出一项全新的三维占据预测任务,旨在从多视角图像中估计物体的详细占据状态与语义信息。为支撑该任务,我们开发了一套标签生成流程,可为给定场景生成密集且具备可见性感知的标签。该流程包括点云聚合、点标记及遮挡处理。基于Waymo开放数据集和nuScenes数据集,我们构建了Occ3D-Waymo和Occ3D-nuScenes两个基准。最后,我们提出名为粗到细占据(CTF-Occ)网络模型,在三维占据预测任务中展现出卓越性能。该方法以粗到细的方式满足对更精细几何理解的需求。代码、数据及基准已在https://tsinghua-mars-lab.github.io/Occ3D/ 开源。