Vision Language Models (VLMs) are increasingly deployed in autonomous vehicles and mobile systems, making it crucial to evaluate their ability to support safer decision-making in complex environments. However, existing benchmarks inadequately cover diverse hazardous situations, especially anomalous scenarios with spatio-temporal dynamics. While image editing models are a promising means to synthesize such hazards, it remains challenging to generate well-formulated scenarios that include moving, intrusive, and distant objects frequently observed in the real world. To address this gap, we introduce \textbf{HazardForge}, a scalable pipeline that leverages image editing models to generate these scenarios with layout decision algorithms, and validation modules. Using HazardForge, we construct \textbf{MovSafeBench}, a multiple-choice question (MCQ) benchmark comprising 7,254 images and corresponding QA pairs across 13 object categories, covering both normal and anomalous objects. Experiments using MovSafeBench show that VLM performance degrades notably under conditions including anomalous objects, with the largest drop in scenarios requiring nuanced motion understanding.
翻译:视觉语言模型(VLMs)正日益广泛应用于自动驾驶车辆和移动系统中,这使得评估其在复杂环境中支持安全决策的能力变得至关重要。然而,现有基准测试未能充分覆盖多样化的危险情境,尤其是具有时空动态性的异常场景。虽然图像编辑模型是合成此类危险场景的一种有前景的手段,但要生成包含现实世界中常见的移动、侵入和远距离物体的结构良好的场景仍然具有挑战性。为弥补这一不足,我们提出了 \textbf{HazardForge},一个可扩展的流程,它利用图像编辑模型,通过布局决策算法和验证模块来生成这些场景。利用 HazardForge,我们构建了 \textbf{MovSafeBench},这是一个包含 7,254 张图像及相应问答对的多项选择题(MCQ)基准测试,涵盖 13 个物体类别,包含正常和异常物体。使用 MovSafeBench 进行的实验表明,在包含异常物体的条件下,VLM 的性能显著下降,其中在需要细致运动理解的场景中下降幅度最大。