Vision-language models (VLMs) have recently emerged as powerful representation learning systems that align visual observations with natural language concepts, offering new opportunities for semantic reasoning in safety-critical autonomous driving. This paper investigates how vision-language representations support driving scene safety assessment and decision-making when integrated into perception, prediction, and planning pipelines. We study three complementary system-level use cases. First, we introduce a lightweight, category-agnostic hazard screening approach leveraging CLIP-based image-text similarity to produce a low-latency semantic hazard signal. This enables robust detection of diverse and out-of-distribution road hazards without explicit object detection or visual question answering. Second, we examine the integration of scene-level vision-language embeddings into a transformer-based trajectory planning framework using the Waymo Open Dataset. Our results show that naively conditioning planners on global embeddings does not improve trajectory accuracy, highlighting the importance of representation-task alignment and motivating the development of task-informed extraction methods for safety-critical planning. Third, we investigate natural language as an explicit behavioral constraint on motion planning using the doScenes dataset. In this setting, passenger-style instructions grounded in visual scene elements suppress rare but severe planning failures and improve safety-aligned behavior in ambiguous scenarios. Taken together, these findings demonstrate that vision-language representations hold significant promise for autonomous driving safety when used to express semantic risk, intent, and behavioral constraints. Realizing this potential is fundamentally an engineering problem requiring careful system design and structured grounding rather than direct feature injection.
翻译:视觉语言模型(VLMs)近期已成为强大的表征学习系统,能够将视觉观测与自然语言概念对齐,为安全关键型自动驾驶中的语义推理提供了新机遇。本文研究了视觉语言表征在融入感知、预测与规划流程时,如何支持驾驶场景安全评估与决策制定。我们探讨了三个互补的系统级应用场景。首先,我们提出了一种轻量级、类别无关的危险筛查方法,该方法利用基于CLIP的图文相似度计算生成低延迟语义危险信号,无需显式目标检测或视觉问答即可实现对多样化及分布外道路危险的鲁棒检测。其次,我们研究了将场景级视觉语言嵌入集成到基于Transformer的轨迹规划框架中的方法(使用Waymo开放数据集)。结果表明,在规划器中简单引入全局嵌入条件并未提升轨迹精度,这凸显了表征与任务对齐的重要性,并推动了面向安全关键规划的任务感知提取方法的开发。第三,我们基于nuScenes数据集探究了将自然语言作为运动规划的显式行为约束。在此设定下,基于视觉场景要素的乘客式指令能够抑制罕见但严重的规划失败,并在模糊场景中提升安全对齐行为。综合而言,这些发现表明,当视觉语言表征被用于表达语义风险、意图与行为约束时,其在自动驾驶安全领域具有重要潜力。实现这一潜力的核心在于工程问题,需要精心的系统设计与结构化落地,而非简单的特征注入。