Generative models in Autonomous Driving (AD) enable diverse scene creation, yet existing methods fall short by only capturing a limited range of modalities, restricting the capability of generating controllable scenes for comprehensive evaluation of AD systems. In this paper, we introduce a multimodal generation framework that incorporates four major data modalities, including a novel addition of map modality. With tokenized modalities, our scene sequence generation framework autoregressively predicts each scene while managing computational demands through a two-stage approach. The Temporal AutoRegressive (TAR) component captures inter-frame dynamics for each modality while the Ordered AutoRegressive (OAR) component aligns modalities within each scene by sequentially predicting tokens in a fixed order. To maintain coherence between map and ego-action modalities, we introduce the Action-aware Map Alignment (AMA) module, which applies a transformation based on the ego-action to maintain coherence between these modalities. Our framework effectively generates complex, realistic driving scenes over extended sequences, ensuring multimodal consistency and offering fine-grained control over scene elements. Project page: https://yanhaowu.github.io/UMGen/
翻译:自动驾驶领域的生成模型能够创建多样化的场景,然而现有方法仅能捕捉有限的模态范围,限制了生成可控场景以全面评估自动驾驶系统的能力。本文提出了一种多模态生成框架,该框架整合了四种主要数据模态,包括新引入的地图模态。通过模态标记化,我们的场景序列生成框架采用自回归方式预测每个场景,并通过两阶段方法管理计算需求。时序自回归组件捕捉每个模态的帧间动态,而有序自回归组件则通过按固定顺序依次预测标记来对齐每个场景内的模态。为保持地图与自车动作模态间的一致性,我们引入了动作感知地图对齐模块,该模块基于自车动作施加变换以维持这些模态间的协调性。我们的框架能够有效生成长时间序列中复杂且真实的驾驶场景,确保多模态一致性,并提供对场景元素的细粒度控制。项目页面:https://yanhaowu.github.io/UMGen/