World models have demonstrated significant promise for data synthesis in autonomous driving. However, existing methods predominantly concentrate on single-modality generation, typically focusing on either multi-camera video or LiDAR sequence synthesis. In this paper, we propose UniDriveDreamer, a single-stage unified multimodal world model for autonomous driving, which directly generates multimodal future observations without relying on intermediate representations or cascaded modules. Our framework introduces a LiDAR-specific variational autoencoder (VAE) designed to encode input LiDAR sequences, alongside a video VAE for multi-camera images. To ensure cross-modal compatibility and training stability, we propose Unified Latent Anchoring (ULA), which explicitly aligns the latent distributions of the two modalities. The aligned features are fused and processed by a diffusion transformer that jointly models their geometric correspondence and temporal evolution. Additionally, structured scene layout information is projected per modality as a conditioning signal to guide the synthesis. Extensive experiments demonstrate that UniDriveDreamer outperforms previous state-of-the-art methods in both video and LiDAR generation, while also yielding measurable improvements in downstream
翻译:世界模型在自动驾驶数据合成方面展现出巨大潜力。然而,现有方法主要集中于单模态生成,通常专注于多摄像头视频或LiDAR序列合成。本文提出UniDriveDreamer,一种用于自动驾驶的单阶段统一多模态世界模型,能够直接生成多模态未来观测数据,无需依赖中间表示或级联模块。我们的框架引入了专为LiDAR设计的变分自编码器(VAE)以编码输入LiDAR序列,同时采用视频VAE处理多摄像头图像。为确保跨模态兼容性与训练稳定性,我们提出统一潜在锚定(ULA)方法,显式对齐两种模态的潜在分布。对齐后的特征经融合后由扩散Transformer处理,该模型联合建模其几何对应关系与时间演化过程。此外,结构化场景布局信息按模态投影作为条件信号以指导合成。大量实验表明,UniDriveDreamer在视频与LiDAR生成任务上均优于先前最先进方法,同时在下游任务中带来可量化的性能提升。