Generative Scenario Rollouts for End-to-End Autonomous Driving

Rajeev Yasarla,Deepti Hegde,Shizhong Han,Hsin-Pai Cheng,Yunxiao Shi,Meysam Sadeghigooghari,Shweta Mahajan,Apratim Bhattacharyya,Litian Liu,Risheek Garrepalli,Thomas Svantesson,Fatih Porikli,Hong Cai

Vision-Language-Action (VLA) models are emerging as highly effective planning models for end-to-end autonomous driving systems. However, current works mostly rely on imitation learning from sparse trajectory annotations and under-utilize their potential as generative models. We propose Generative Scenario Rollouts (GeRo), a plug-and-play framework for VLA models that jointly performs planning and generation of language-grounded future traffic scenes through an autoregressive rollout strategy. First, a VLA model is trained to encode ego vehicle and agent dynamics into latent tokens under supervision from planning, motion, and language tasks, facilitating text-aligned generation. Next, GeRo performs language-conditioned autoregressive generation. Given multi-view images, a scenario description, and ego-action questions, it generates future latent tokens and textual responses to guide long-horizon rollouts. A rollout-consistency loss stabilizes predictions using ground truth or pseudo-labels, mitigating drift and preserving text-action alignment. This design enables GeRo to perform temporally consistent, language-grounded rollouts that support long-horizon reasoning and multi-agent planning. On Bench2Drive, GeRo improves driving score and success rate by +15.7 and +26.2, respectively. By integrating reinforcement learning with generative rollouts, GeRo achieves state-of-the-art closed-loop and open-loop performance, demonstrating strong zero-shot robustness. These results highlight the promise of generative, language-conditioned reasoning as a foundation for safer and more interpretable end-to-end autonomous driving.

翻译：视觉-语言-行动（VLA）模型正逐渐成为端到端自动驾驶系统中高效的规划模型。然而，当前研究大多依赖于稀疏轨迹标注的模仿学习，未能充分利用其作为生成模型的潜力。我们提出了生成式场景推演（GeRo），一个即插即用的VLA模型框架，通过自回归推演策略联合执行规划与基于语言的未来交通场景生成。首先，在规划、运动及语言任务的监督下训练一个VLA模型，将自车与交通参与者的动态编码为潜在标记，以促进文本对齐的生成。接着，GeRo执行语言条件的自回归生成。给定多视角图像、场景描述及自车动作问题，它生成未来的潜在标记与文本响应，以指导长时域推演。推演一致性损失利用真实标注或伪标签稳定预测，缓解漂移并保持文本-动作对齐。该设计使GeRo能够执行时序一致、基于语言的场景推演，支持长时域推理与多智能体规划。在Bench2Drive基准上，GeRo将驾驶分数和成功率分别提升了+15.7和+26.2。通过将强化学习与生成式推演相结合，GeRo实现了最先进的闭环与开环性能，并展现出强大的零样本鲁棒性。这些结果凸显了生成式、语言条件推理作为构建更安全、更可解释的端到端自动驾驶基础的潜力。