An LLM-driven Scenario Generation Pipeline Using an Extended Scenic DSL for Autonomous Driving Safety Validation

Real-world crash reports, which combine textual summaries and sketches, are valuable for scenario-based testing of autonomous driving systems (ADS). However, current methods cannot effectively translate this multimodal data into precise, executable simulation scenarios, hindering the scalability of ADS safety validation. In this work, we propose a scalable and verifiable pipeline that uses a large language model (GPT-4o mini) and a probabilistic intermediate representation (an Extended Scenic domain-specific language) to automatically extract semantic scenario configurations from crash reports and generate corresponding simulation-ready scenarios. Unlike earlier approaches such as ScenicNL and LCTGen (which generate scenarios directly from text) or TARGET (which uses deterministic mappings from traffic rules), our method introduces an intermediate Scenic DSL layer to separate high-level semantic understanding from low-level scenario rendering, reducing errors and capturing real-world variability. We evaluated the pipeline on cases from the NHTSA CIREN database. The results show high accuracy in knowledge extraction: 100% correctness for environmental and road network attributes, and 97% and 98% for oracle and actor trajectories, respectively, compared to human-derived ground truth. We executed the generated scenarios in the CARLA simulator using the Autoware driving stack, and they consistently triggered the intended traffic-rule violations (such as opposite-lane crossing and red-light running) across 2,000 scenario variations. These findings demonstrate that the proposed pipeline provides a legally grounded, scalable, and verifiable approach to ADS safety validation.

翻译：真实世界事故报告结合了文本摘要与示意图，为自动驾驶系统（ADS）的场景化测试提供了宝贵资源。然而，现有方法难以将此类多模态数据有效转化为精确、可执行的仿真场景，制约了ADS安全验证的可扩展性。本研究提出一种可扩展且可验证的流程，利用大语言模型（GPT-4o mini）与概率化中间表示（扩展Scenic领域特定语言），自动从事故报告中提取语义场景配置并生成相应的仿真就绪场景。相较于ScenicNL和LCTGen（直接从文本生成场景）或TARGET（基于交通规则的确定性映射）等早期方法，本方法引入中间层Scenic DSL以分离高层语义理解与底层场景渲染，从而减少错误并捕捉现实世界的可变性。我们在NHTSA CIREN数据库的案例上评估了该流程。结果显示知识提取具有高精度：环境与路网属性正确率达100%，关键目标与参与者轨迹分别达到97%和98%（以人工标注真值为基准）。通过在CARLA仿真器中运行Autoware驾驶栈执行生成场景，2000个场景变体均能稳定触发预期的交通规则违反行为（如逆向车道穿越与闯红灯）。这些结果表明，所提出的流程为ADS安全验证提供了法律依据充分、可扩展且可验证的解决方案。