Robotic perception requires the modeling of both 3D geometry and semantics. Existing methods typically focus on estimating 3D bounding boxes, neglecting finer geometric details and struggling to handle general, out-of-vocabulary objects. 3D occupancy prediction, which estimates the detailed occupancy states and semantics of a scene, is an emerging task to overcome these limitations. To support 3D occupancy prediction, we develop a label generation pipeline that produces dense, visibility-aware labels for any given scene. This pipeline comprises three stages: voxel densification, occlusion reasoning, and image-guided voxel refinement. We establish two benchmarks, derived from the Waymo Open Dataset and the nuScenes Dataset, namely Occ3D-Waymo and Occ3D-nuScenes benchmarks. Furthermore, we provide an extensive analysis of the proposed dataset with various baseline models. Lastly, we propose a new model, dubbed Coarse-to-Fine Occupancy (CTF-Occ) network, which demonstrates superior performance on the Occ3D benchmarks. The code, data, and benchmarks are released at https://tsinghua-mars-lab.github.io/Occ3D/.
翻译:机器人感知需要同时建模三维几何与语义信息。现有方法通常聚焦于三维边界框估计,忽视了几何细节的精细化建模,且难以处理通用、词汇外的物体。三维占用预测作为新兴任务,通过估计场景的详细占用状态与语义信息来克服上述局限。为支持三维占用预测,我们开发了一套适用于任意场景的标签生成流程,可生成致密且具备可见性感知的标签。该流程包含三个阶段:体素密化、遮挡推理与图像引导的体素精炼。基于Waymo Open Dataset与nuScenes数据集,我们构建了两个基准——Occ3D-Waymo与Occ3D-nuScenes。此外,我们通过多种基线模型对所提数据集进行了全面分析。最后,我们提出一种名为"由粗到细占用网络(CTF-Occ)"的新型模型,该模型在Occ3D基准上展现出卓越性能。相关代码、数据及基准已发布于https://tsinghua-mars-lab.github.io/Occ3D/。