Vision-Language Models (VLMs) are increasingly deployed in autonomous driving and embodied AI systems, where reliable perception is critical for safe semantic reasoning and decision-making. While recent VLMs demonstrate strong performance on multimodal benchmarks, their robustness to realistic perception degradation remains poorly understood. In this work, we systematically study semantic misalignment in VLMs under controlled degradation of upstream visual perception, using semantic segmentation on the Cityscapes dataset as a representative perception module. We introduce perception-realistic corruptions that induce only moderate drops in conventional segmentation metrics, yet observe severe failures in downstream VLM behavior, including hallucinated object mentions, omission of safety-critical entities, and inconsistent safety judgments. To quantify these effects, we propose a set of language-level misalignment metrics that capture hallucination, critical omission, and safety misinterpretation, and analyze their relationship with segmentation quality across multiple contrastive and generative VLMs. Our results reveal a clear disconnect between pixel-level robustness and multimodal semantic reliability, highlighting a critical limitation of current VLM-based systems and motivating the need for evaluation frameworks that explicitly account for perception uncertainty in safety-critical applications.
翻译:视觉语言模型(VLMs)正日益部署于自动驾驶和具身人工智能系统中,其中可靠的感知对于安全的语义推理与决策至关重要。尽管近期的VLMs在多模态基准测试中展现出强大性能,但其对现实感知退化的鲁棒性仍鲜为人知。在本工作中,我们以Cityscapes数据集上的语义分割作为代表性感知模块,通过控制上游视觉感知的退化,系统性地研究了VLMs中的语义失准现象。我们引入了感知现实的退化方式,这些退化仅导致传统分割指标适度下降,却观察到下游VLM行为出现严重故障,包括幻觉性物体提及、安全关键实体的遗漏以及不一致的安全判断。为量化这些影响,我们提出了一组语言层面的失准度量指标,用以捕捉幻觉、关键遗漏和安全误判,并分析了它们与分割质量在多种对比式和生成式VLM中的关系。我们的结果揭示了像素级鲁棒性与多模态语义可靠性之间存在明显脱节,突显了当前基于VLM的系统的一个关键局限,并强调了在安全关键应用中亟需开发能明确考虑感知不确定性的评估框架。