Understanding the evolution of 3D scenes is important for effective autonomous driving. While conventional methods mode scene development with the motion of individual instances, world models emerge as a generative framework to describe the general scene dynamics. However, most existing methods adopt an autoregressive framework to perform next-token prediction, which suffer from inefficiency in modeling long-term temporal evolutions. To address this, we propose a diffusion-based 4D occupancy generation model, OccSora, to simulate the development of the 3D world for autonomous driving. We employ a 4D scene tokenizer to obtain compact discrete spatial-temporal representations for 4D occupancy input and achieve high-quality reconstruction for long-sequence occupancy videos. We then learn a diffusion transformer on the spatial-temporal representations and generate 4D occupancy conditioned on a trajectory prompt. We conduct extensive experiments on the widely used nuScenes dataset with Occ3D occupancy annotations. OccSora can generate 16s-videos with authentic 3D layout and temporal consistency, demonstrating its ability to understand the spatial and temporal distributions of driving scenes. With trajectory-aware 4D generation, OccSora has the potential to serve as a world simulator for the decision-making of autonomous driving. Code is available at: https://github.com/wzzheng/OccSora.
翻译:理解三维场景的演化对于实现高效自动驾驶至关重要。传统方法通常通过独立实例的运动来建模场景发展,而世界模型作为一种生成式框架,旨在描述通用的场景动态。然而,现有方法大多采用自回归框架进行下一令牌预测,在建模长期时间演化时存在效率低下的问题。为此,我们提出了一种基于扩散的4D占据生成模型OccSora,用于模拟自动驾驶中的三维世界发展。我们采用4D场景分词器,为4D占据输入获取紧凑的离散时空表示,并实现对长序列占据视频的高质量重建。随后,我们在时空表示上学习一个扩散Transformer,并根据轨迹提示生成条件化的4D占据。我们在广泛使用的nuScenes数据集(附带Occ3D占据标注)上进行了大量实验。OccSora能够生成具有真实三维布局和时间一致性的16秒视频,证明了其理解驾驶场景时空分布的能力。通过轨迹感知的4D生成,OccSora有潜力作为自动驾驶决策的世界模拟器。代码发布于:https://github.com/wzzheng/OccSora。