Reliable environmental perception remains one of the main obstacles for safe operation of automated vehicles. Safety of the Intended Functionality (SOTIF) concerns safety risks from perception insufficiencies, particularly under adverse conditions where conventional detectors often falter. While Large Vision-Language Models (LVLMs) demonstrate promising semantic reasoning, their quantitative effectiveness for safety-critical 2D object detection is underexplored. This paper presents a systematic evaluation of ten representative LVLMs using the PeSOTIF dataset, a benchmark specifically curated for long-tail traffic scenarios and environmental degradations. Performance is quantitatively compared against the classical perception approach, a YOLO-based detector. Experimental results reveal a critical trade-off: top-performing LVLMs (e.g., Gemini 3, Doubao) surpass the YOLO baseline in recall by over 25% in complex natural scenarios, exhibiting superior robustness to visual degradation. Conversely, the baseline retains an advantage in geometric precision for synthetic perturbations. These findings highlight the complementary strengths of semantic reasoning versus geometric regression, supporting the use of LVLMs as high-level safety validators in SOTIF-oriented automated driving systems.
翻译:可靠的环境感知仍然是自动驾驶车辆安全运行的主要障碍之一。预期功能安全(SOTIF)关注由感知不足引发的安全风险,尤其在传统检测器常失效的恶劣条件下。尽管大规模视觉-语言模型(LVLMs)展现出有前景的语义推理能力,但其在安全攸关的二维目标检测任务中的量化效能尚未得到充分探索。本文利用PeSOTIF数据集对十种代表性LVLM进行了系统评估,该数据集是专门为长尾交通场景与环境退化问题构建的基准测试集。研究通过量化方式将LVLM性能与经典感知方法(基于YOLO的检测器)进行对比。实验结果表明一个关键权衡:在复杂自然场景中,性能最优的LVLM(如Gemini 3、Doubao)的召回率超越YOLO基线超过25%,展现出对视觉退化的卓越鲁棒性;反之,基线模型在合成扰动的几何精度方面保持优势。这些发现揭示了语义推理与几何回归的互补特性,支持将LVLM作为面向SOTIF的自动驾驶系统中的高层级安全验证器使用。