We introduce ScenarioControl, the first vision-language control mechanism for learned driving scenario generation. Given a text prompt or an input image, Scenario-Control synthesizes diverse, realistic 3D scenario rollouts - including map, 3D boxes of reactive actors over time, pedestrians, driving infrastructure, and ego camera observations. The method generates scenes in a vectorized latent space that represents road structure and dynamic agents jointly. To connect multimodal control with sparse vectorized scene elements, we propose a cross-global control mechanism that integrates crossattention with a lightweight global-context branch, enabling fine-grained control over road layout and traffic conditions while preserving realism. The method produces temporally consistent scenario rollouts from the perspectives different actors in the scene, supporting long-horizon continuation of driving scenarios. To facilitate training and evaluation, we release a dataset with text annotations aligned to vectorized map structures. Extensive experiments validate that the control adherence and fidelity of ScenarioControl compare favorable to all tested methods across all experiments. Project webpage: https://light.princeton.edu/ScenarioControl
翻译:我们提出ScenarioControl,这是首个用于学习型驾驶场景生成的视觉-语言控制机制。给定文本提示或输入图像,ScenarioControl可合成多样且逼真的3D场景推演——包括地图、随时间变化的反应式智能体3D框、行人、驾驶基础设施及自车摄像头观测。该方法在表示道路结构与动态智能体的矢量化潜在空间中生成场景。为连接多模态控制与稀疏矢量化场景元素,我们提出一种跨全局控制机制,将交叉注意力与轻量级全局上下文分支相结合,在保持真实性的同时实现对道路布局和交通状况的精细控制。该方法可从场景中不同智能体的视角生成时间一致的场景推演,支持驾驶场景的长时程延续。为促进训练与评估,我们发布了带有与矢量化地图结构对齐的文本标注的数据集。大量实验验证表明,ScenarioControl的控制遵循度与逼真度在所有实验中均优于所有被测方法。项目网页:https://light.princeton.edu/ScenarioControl