Vision and Language: Novel Representations and Artificial intelligence for Driving Scene Safety Assessment and Autonomous Vehicle Planning

Vision-language models (VLMs) have recently emerged as powerful representation learning systems that align visual observations with natural language concepts, offering new opportunities for semantic reasoning in safety-critical autonomous driving. This paper investigates how vision-language representations support driving scene safety assessment and decision-making when integrated into perception, prediction, and planning pipelines. We study three complementary system-level use cases. First, we introduce a lightweight, category-agnostic hazard screening approach leveraging CLIP-based image-text similarity to produce a low-latency semantic hazard signal. This enables robust detection of diverse and out-of-distribution road hazards without explicit object detection or visual question answering. Second, we examine the integration of scene-level vision-language embeddings into a transformer-based trajectory planning framework using the Waymo Open Dataset. Our results show that naively conditioning planners on global embeddings does not improve trajectory accuracy, highlighting the importance of representation-task alignment and motivating the development of task-informed extraction methods for safety-critical planning. Third, we investigate natural language as an explicit behavioral constraint on motion planning using the doScenes dataset. In this setting, passenger-style instructions grounded in visual scene elements suppress rare but severe planning failures and improve safety-aligned behavior in ambiguous scenarios. Taken together, these findings demonstrate that vision-language representations hold significant promise for autonomous driving safety when used to express semantic risk, intent, and behavioral constraints. Realizing this potential is fundamentally an engineering problem requiring careful system design and structured grounding rather than direct feature injection.

翻译：视觉语言模型（VLMs）作为强大的表征学习系统近期崭露头角，其将视觉观测与自然语言概念对齐，为安全关键型自动驾驶中的语义推理提供了新的机遇。本文研究了将视觉语言表征集成到感知、预测与规划流程中时，如何支持驾驶场景安全评估与决策制定。我们探讨了三个互补的系统级应用场景。首先，我们提出一种轻量级、类别无关的危险筛查方法，该方法利用基于CLIP的图像-文本相似性生成低延迟语义危险信号。这能够在无需显式目标检测或视觉问答的情况下，实现对多样化及分布外道路危险的鲁棒检测。其次，我们研究了将场景级视觉语言嵌入集成到基于Transformer的轨迹规划框架中（使用Waymo Open数据集）。结果表明，简单地将规划器条件化于全局嵌入并不能提升轨迹精度，这凸显了表征-任务对齐的重要性，并推动了面向安全关键规划的任务感知提取方法的开发。第三，我们利用doScenes数据集探究了自然语言作为运动规划的显式行为约束。在此设定下，基于视觉场景元素的乘客式指令能够抑制罕见但严重的规划失败，并在模糊场景中提升安全对齐的行为。综上所述，这些发现表明，当用于表达语义风险、意图和行为约束时，视觉语言表征对自动驾驶安全具有重要潜力。实现这一潜力的核心是一个工程问题，需要精心的系统设计和结构化 grounding，而非简单的特征注入。

相关内容

安全评估

关注 11

安全评估分狭义和广义二种。狭义指对一个具有特定功能的工作系统中固有的或潜在的危险及其严重程度所进行的分析与评估，并以既定指数、等级或概率值作出定量的表示，最后根据定量值的大小决定采取预防或防护对策。广义指利用系统工程原理和方法对拟建或已有工程、系统可能存在的危险性及其可能产生的后果进行综合评价和预测，并根据可能导致的事故风险的大小，提出相应的安全对策措施，以达到工程、系统安全的过程。安全评估又称风险评估、危险评估，或称安全评价、风险评价和危险评价。

在无标注条件下适配视觉—语言模型：全面综述

专知会员服务

13+阅读 · 2025年8月9日

面向视觉语言模型的持续学习：遗忘之外的综述与分类体系

专知会员服务

21+阅读 · 2025年8月9日

视觉语言模型泛化到新领域：全面综述

专知会员服务

38+阅读 · 2025年6月27日

视觉语言建模遇见遥感：模型、数据集与前景展望

专知会员服务

17+阅读 · 2025年5月21日