Autonomous driving has seen remarkable advancements, largely driven by extensive real-world data collection. However, acquiring diverse and corner-case data remains costly and inefficient. Generative models have emerged as a promising solution by synthesizing realistic sensor data. However, existing approaches primarily focus on single-modality generation, leading to inefficiencies and misalignment in multimodal sensor data. To address these challenges, we propose OminiGen, which generates aligned multimodal sensor data in a unified framework. Our approach leverages a shared Bird\u2019s Eye View (BEV) space to unify multimodal features and designs a novel generalizable multimodal reconstruction method, UAE, to jointly decode LiDAR and multi-view camera data. UAE achieves multimodal sensor decoding through volume rendering, enabling accurate and flexible reconstruction. Furthermore, we incorporate a Diffusion Transformer (DiT) with a ControlNet branch to enable controllable multimodal sensor generation. Our comprehensive experiments demonstrate that OminiGen achieves desired performances in unified multimodal sensor data generation with multimodal consistency and flexible sensor adjustments.
翻译:自动驾驶领域已取得显著进展,这主要得益于大规模真实世界数据的采集。然而,获取多样化及极端场景数据仍成本高昂且效率低下。生成模型通过合成逼真的传感器数据,已成为一种颇具前景的解决方案。但现有方法主要集中于单模态生成,导致多模态传感器数据存在效率低下与对齐失准的问题。为应对这些挑战,我们提出了OmniGen,该框架可在统一架构中生成对齐的多模态传感器数据。我们的方法利用共享的鸟瞰图空间统一多模态特征,并设计了一种新颖的通用多模态重建方法UAE,以联合解码激光雷达与多视角相机数据。UAE通过体渲染实现多模态传感器解码,从而达成精确且灵活的重建。此外,我们引入了结合ControlNet分支的扩散Transformer,以实现可控的多模态传感器生成。综合实验表明,OmniGen在统一多模态传感器数据生成方面实现了预期性能,具备多模态一致性与灵活的传感器调节能力。