3D occupancy prediction holds significant promise in the fields of robot perception and autonomous driving, which quantifies 3D scenes into grid cells with semantic labels. Recent works mainly utilize complete occupancy labels in 3D voxel space for supervision. However, the expensive annotation process and sometimes ambiguous labels have severely constrained the usability and scalability of 3D occupancy models. To address this, we present RenderOcc, a novel paradigm for training 3D occupancy models only using 2D labels. Specifically, we extract a NeRF-style 3D volume representation from multi-view images, and employ volume rendering techniques to establish 2D renderings, thus enabling direct 3D supervision from 2D semantics and depth labels. Additionally, we introduce an Auxiliary Ray method to tackle the issue of sparse viewpoints in autonomous driving scenarios, which leverages sequential frames to construct comprehensive 2D rendering for each object. To our best knowledge, RenderOcc is the first attempt to train multi-view 3D occupancy models only using 2D labels, reducing the dependence on costly 3D occupancy annotations. Extensive experiments demonstrate that RenderOcc achieves comparable performance to models fully supervised with 3D labels, underscoring the significance of this approach in real-world applications.
翻译:三维占用预测在机器人感知和自动驾驶领域具有重要前景,它将三维场景量化为带有语义标签的网格单元。现有工作主要利用三维体素空间中的完整占用标签进行监督。然而,昂贵的标注过程以及有时存在的模糊标签严重制约了三维占用模型的可用性和可扩展性。为解决这一问题,我们提出RenderOcc——一种仅使用二维标签训练三维占用模型的新范式。具体而言,我们从多视图图像中提取类NeRF三维体素表示,并利用体渲染技术生成二维渲染图,从而能够从二维语义和深度标签直接提供三维监督。此外,我们引入辅助射线方法以应对自动驾驶场景中视角稀疏的问题,该方法通过利用连续帧为每个目标构建全面的二维渲染。据我们所知,RenderOcc是首个仅使用二维标签训练多视图三维占用模型的研究,降低了对昂贵三维占用标注的依赖。大量实验表明,RenderOcc能达到与使用三维标签全监督模型相当的性能,凸显了该方法在现实应用中的重要意义。