OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving

Understanding the evolution of 3D scenes is important for effective autonomous driving. While conventional methods mode scene development with the motion of individual instances, world models emerge as a generative framework to describe the general scene dynamics. However, most existing methods adopt an autoregressive framework to perform next-token prediction, which suffer from inefficiency in modeling long-term temporal evolutions. To address this, we propose a diffusion-based 4D occupancy generation model, OccSora, to simulate the development of the 3D world for autonomous driving. We employ a 4D scene tokenizer to obtain compact discrete spatial-temporal representations for 4D occupancy input and achieve high-quality reconstruction for long-sequence occupancy videos. We then learn a diffusion transformer on the spatial-temporal representations and generate 4D occupancy conditioned on a trajectory prompt. We conduct extensive experiments on the widely used nuScenes dataset with Occ3D occupancy annotations. OccSora can generate 16s-videos with authentic 3D layout and temporal consistency, demonstrating its ability to understand the spatial and temporal distributions of driving scenes. With trajectory-aware 4D generation, OccSora has the potential to serve as a world simulator for the decision-making of autonomous driving. Code is available at: https://github.com/wzzheng/OccSora.

翻译：理解三维场景的演化对于实现高效自动驾驶至关重要。传统方法通常通过独立实例的运动来建模场景发展，而世界模型作为一种生成式框架，旨在描述通用的场景动态。然而，现有方法大多采用自回归框架进行下一令牌预测，在建模长期时间演化时存在效率低下的问题。为此，我们提出了一种基于扩散的4D占据生成模型OccSora，用于模拟自动驾驶中的三维世界发展。我们采用4D场景分词器，为4D占据输入获取紧凑的离散时空表示，并实现对长序列占据视频的高质量重建。随后，我们在时空表示上学习一个扩散Transformer，并根据轨迹提示生成条件化的4D占据。我们在广泛使用的nuScenes数据集（附带Occ3D占据标注）上进行了大量实验。OccSora能够生成具有真实三维布局和时间一致性的16秒视频，证明了其理解驾驶场景时空分布的能力。通过轨迹感知的4D生成，OccSora有潜力作为自动驾驶决策的世界模拟器。代码发布于：https://github.com/wzzheng/OccSora。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日