SG-CADVLM: A Context-Aware Decoding Powered Vision Language Model for Safety-Critical Scenario Generation

Autonomous vehicle safety validation requires testing on safety-critical scenarios, but these events are rare in real-world driving and costly to test due to collision risks. Crash reports provide authentic specifications of safety-critical events, offering a vital alternative to scarce real-world collision trajectory data. This makes them valuable sources for generating realistic high-risk scenarios through simulation. Existing approaches face significant limitations because data-driven methods lack diversity due to their reliance on existing latent distributions, whereas adversarial methods often produce unrealistic scenarios lacking physical fidelity. Large Language Model (LLM) and Vision Language Model (VLM)-based methods show significant promise. However, they suffer from context suppression issues where internal parametric knowledge overrides crash specifications, producing scenarios that deviate from actual accident characteristics. This paper presents SG-CADVLM (A Context-Aware Decoding Powered Vision Language Model for Safety-Critical Scenario Generation), a framework that integrates Context-Aware Decoding with multi-modal input processing to generate safety-critical scenarios from crash reports and road network diagrams. The framework mitigates VLM hallucination issues while enabling the simultaneous generation of road geometry and vehicle trajectories. The experimental results demonstrate that SG-CADVLM generates critical risk scenarios at a rate of 84.4% compared to 12.5% for the baseline methods, representing an improvement of 469%, while producing executable simulations for autonomous vehicle testing.

翻译：自动驾驶车辆的安全验证需要在安全关键场景下进行测试，但这些事件在现实世界驾驶中较为罕见，且因碰撞风险而测试成本高昂。事故报告提供了安全关键事件的真实规范，为稀缺的真实世界碰撞轨迹数据提供了一个重要的替代来源。这使其成为通过仿真生成真实高风险场景的宝贵资源。现有方法面临显著局限性：数据驱动方法因其对现有潜在分布的依赖而缺乏多样性，而对抗性方法则常产生缺乏物理真实性的不现实场景。基于大语言模型和视觉语言模型的方法展现出巨大潜力。然而，它们存在上下文抑制问题，即内部参数知识会覆盖事故规范，导致生成的场景偏离实际事故特征。本文提出SG-CADVLM（一种用于安全关键场景生成的上下文感知解码驱动的视觉语言模型），该框架将上下文感知解码与多模态输入处理相结合，以从事故报告和路网图中生成安全关键场景。该框架缓解了VLM的幻觉问题，同时实现了道路几何与车辆轨迹的同步生成。实验结果表明，SG-CADVLM生成关键风险场景的比例为84.4%，而基线方法仅为12.5%，提升了469%，并能生成用于自动驾驶测试的可执行仿真。