Comprehensive modeling of the surrounding 3D world is key to the success of autonomous driving. However, existing perception tasks like object detection, road structure segmentation, depth & elevation estimation, and open-set object localization each only focus on a small facet of the holistic 3D scene understanding task. This divide-and-conquer strategy simplifies the algorithm development procedure at the cost of losing an end-to-end unified solution to the problem. In this work, we address this limitation by studying camera-based 3D panoptic segmentation, aiming to achieve a unified occupancy representation for camera-only 3D scene understanding. To achieve this, we introduce a novel method called PanoOcc, which utilizes voxel queries to aggregate spatiotemporal information from multi-frame and multi-view images in a coarse-to-fine scheme, integrating feature learning and scene representation into a unified occupancy representation. We have conducted extensive ablation studies to verify the effectiveness and efficiency of the proposed method. Our approach achieves new state-of-the-art results for camera-based semantic segmentation and panoptic segmentation on the nuScenes dataset. Furthermore, our method can be easily extended to dense occupancy prediction and has shown promising performance on the Occ3D benchmark. The code will be released at https://github.com/Robertwyq/PanoOcc.
翻译:对周围3D世界的全面建模是自动驾驶成功的关键。然而,现有的感知任务如目标检测、道路结构分割、深度与高程估计以及开放集目标定位,每个都仅聚焦于整体3D场景理解任务的一个小方面。这种分治策略简化了算法开发流程,但代价是缺乏端到端的统一解决方案。本文通过研究基于相机的3D全景分割来克服这一局限,旨在为纯相机3D场景理解实现统一的占据表示。为此,我们提出了一种名为PanoOcc的新方法,该方法利用体素查询,以由粗到细的方式从多帧和多视图图像中聚合时空信息,将特征学习和场景表示整合到统一的占据表示中。我们进行了广泛的消融研究,以验证所提方法的有效性和效率。我们的方法在nuScenes数据集上的基于相机的语义分割和全景分割任务中取得了新的最佳结果。此外,该方法可轻松扩展至密集占据预测,并在Occ3D基准上展现了有前景的性能。代码将发布在https://github.com/Robertwyq/PanoOcc。