Understanding how the 3D scene evolves is vital for making decisions in autonomous driving. Most existing methods achieve this by predicting the movements of object boxes, which cannot capture more fine-grained scene information. In this paper, we explore a new framework of learning a world model, OccWorld, in the 3D Occupancy space to simultaneously predict the movement of the ego car and the evolution of the surrounding scenes. We propose to learn a world model based on 3D occupancy rather than 3D bounding boxes and segmentation maps for three reasons: 1) expressiveness. 3D occupancy can describe the more fine-grained 3D structure of the scene; 2) efficiency. 3D occupancy is more economical to obtain (e.g., from sparse LiDAR points). 3) versatility. 3D occupancy can adapt to both vision and LiDAR. To facilitate the modeling of the world evolution, we learn a reconstruction-based scene tokenizer on the 3D occupancy to obtain discrete scene tokens to describe the surrounding scenes. We then adopt a GPT-like spatial-temporal generative transformer to generate subsequent scene and ego tokens to decode the future occupancy and ego trajectory. Extensive experiments on the widely used nuScenes benchmark demonstrate the ability of OccWorld to effectively model the evolution of the driving scenes. OccWorld also produces competitive planning results without using instance and map supervision. Code: https://github.com/wzzheng/OccWorld.
翻译:理解3D场景如何演化对于自动驾驶决策至关重要。现有方法大多通过预测目标包围框的运动来实现这一目标,但无法捕捉更精细的场景信息。本文探索了一种基于3D占用空间学习世界模型的新框架——OccWorld,用于同时预测自车的运动与周围场景的演化。我们提出基于3D占用而非3D边界框和分割图来学习世界模型,原因有三:1)表达能力。3D占用能描述场景更精细的三维结构;2)效率性。3D占用更易于获取(例如从稀疏激光雷达点云);3)通用性。3D占用可同时适配视觉和激光雷达传感器。为便于建模世界演化,我们在3D占用空间上学习基于重构的场景分词器,获取离散场景标记以描述周围场景,随后采用类GPT的时空生成式Transformer生成后续场景与自车标记,用以解码未来占用与自车轨迹。在广泛使用的nuScenes基准上的大量实验表明,OccWorld能有效建模驾驶场景的演化,并在无需实例和地图监督的条件下产生具有竞争力的规划结果。代码:https://github.com/wzzheng/OccWorld。