Generating realistic and controllable traffic scenes from natural language can greatly enhance the development and evaluation of autonomous driving systems. However, this task poses unique challenges: (1) grounding free-form text into spatially valid and semantically coherent layouts, (2) composing scenarios without predefined locations, and (3) planning multi-agent behaviors and selecting roads that respect agents' configurations. To address these, we propose a modular framework, TTSG, comprising prompt analysis, road retrieval, agent planning, and a novel plan-aware road ranking algorithm to solve these challenges. While large language models (LLMs) are used as general planners, our design integrates them into a tightly controlled pipeline that enforces structure, feasibility, and scene diversity. Notably, our ranking strategy ensures consistency between agent actions and road geometry, enabling scene generation without predefined routes or spawn points. The framework supports both routine and safety-critical scenarios, as well as multi-stage event composition. Experiments on SafeBench demonstrate that our method achieves the lowest average collision rate (3.5\%) across three critical scenarios. Moreover, driving captioning models trained on our generated scenes improve action reasoning by over 30 CIDEr points. These results underscore our proposed framework for flexible, interpretable, and safety-oriented simulation.
翻译:从自然语言生成真实且可控的交通场景能极大促进自动驾驶系统的开发与评估。然而,该任务面临独特挑战:(1) 将自由文本映射为空间有效且语义连贯的布局;(2) 无需预设位置即可组合场景;(3) 规划多智能体行为并选择尊重智能体配置的道路。为此,我们提出模块化框架TTSG,包含提示分析、道路检索、智能体规划及一种新颖的规划感知道路排序算法,以解决上述挑战。尽管大语言模型(LLM)被用作通用规划器,我们的设计将其集成至一个严格受控的流水线,强制保障结构、可行性与场景多样性。值得注意的是,所提排序策略确保了智能体行为与道路几何结构的一致性,无需预设路径或生成点即可生成场景。该框架既支持常规场景也支持安全关键场景,以及多阶段事件组合。在SafeBench上的实验表明,我们的方法在三个关键场景中实现了最低的平均碰撞率(3.5%)。此外,基于我们生成场景训练的驾驶描述模型在动作推理上提升了超过30个CIDEr点。这些结果突显了所提框架在灵活、可解释且面向安全的仿真中的潜力。