Robotic perception requires the modeling of both 3D geometry and semantics. Existing methods typically focus on estimating 3D bounding boxes, neglecting finer geometric details and struggling to handle general, out-of-vocabulary objects. 3D occupancy prediction, which estimates the detailed occupancy states and semantics of a scene, is an emerging task to overcome these limitations. To support 3D occupancy prediction, we develop a label generation pipeline that produces dense, visibility-aware labels for any given scene. This pipeline comprises three stages: voxel densification, occlusion reasoning, and image-guided voxel refinement. We establish two benchmarks, derived from the Waymo Open Dataset and the nuScenes Dataset, namely Occ3D-Waymo and Occ3D-nuScenes benchmarks. Furthermore, we provide an extensive analysis of the proposed dataset with various baseline models. Lastly, we propose a new model, dubbed Coarse-to-Fine Occupancy (CTF-Occ) network, which demonstrates superior performance on the Occ3D benchmarks. The code, data, and benchmarks are released at https://tsinghua-mars-lab.github.io/Occ3D/.
翻译:机器人感知需要同时对三维几何与语义进行建模。现有方法通常侧重于估计三维边界框,忽视了更精细的几何细节,且难以处理通用的、词表外的物体。三维占据预测任务通过估计场景中详细的占据状态与语义信息,克服了这些局限性。为支持三维占据预测,我们开发了一套标签生成流程,可为任意给定场景生成密集且具有可见性感知的标签。该流程包含三个阶段:体素稠密化、遮挡推理和图像引导的体素细化。我们基于Waymo开放数据集和nuScenes数据集构建了两个基准——Occ3D-Waymo和Occ3D-nuScenes。此外,我们利用多种基线模型对所提出数据集进行了全面分析。最后,我们提出一种名为"由粗到细占据网络"(CTF-Occ)的新模型,该模型在Occ3D基准上展现出优越性能。代码、数据和基准已在https://tsinghua-mars-lab.github.io/Occ3D/ 开源。