Autonomous driving systems remain critically vulnerable to the long-tail of rare, out-of-distribution scenarios with semantic anomalies. While Vision Language Models (VLMs) offer promising reasoning capabilities, naive prompting approaches yield unreliable performance and depend on expensive proprietary models, limiting practical deployment. We introduce SAVANT (Semantic Analysis with Vision-Augmented Anomaly deTection), a structured reasoning framework that achieves high accuracy and recall in detecting anomalous driving scenarios from input images through layered scene analysis and a two-phase pipeline: structured scene description extraction followed by multi-modal evaluation. Our approach transforms VLM reasoning from ad-hoc prompting to systematic analysis across four semantic layers: Street, Infrastructure, Movable Objects, and Environment. SAVANT achieves 89.6% recall and 88.0% accuracy on real-world driving scenarios, significantly outperforming unstructured baselines. More importantly, we demonstrate that our structured framework enables a fine-tuned 7B parameter open-source model (Qwen2.5VL) to achieve 90.8% recall and 93.8% accuracy - surpassing all models evaluated while enabling local deployment at near-zero cost. By automatically labeling over 9,640 real-world images with high accuracy, SAVANT addresses the critical data scarcity problem in anomaly detection and provides a practical path toward reliable, accessible semantic monitoring for autonomous systems.
翻译:自动驾驶系统在面对具有语义异常的罕见、分布外场景的长尾问题时,仍然存在严重脆弱性。尽管视觉语言模型展现出有前景的推理能力,但简单的提示方法性能不可靠,且依赖昂贵的专有模型,限制了实际部署。我们提出了SAVANT(基于视觉增强异常检测的语义分析),这是一个结构化推理框架,通过分层场景分析和两阶段流程——结构化场景描述提取与多模态评估——从输入图像中检测异常驾驶场景,实现了高准确率和高召回率。我们的方法将VLM推理从临时提示转变为跨四个语义层(街道、基础设施、可移动物体、环境)的系统分析。SAVANT在真实驾驶场景中实现了89.6%的召回率和88.0%的准确率,显著优于非结构化基线方法。更重要的是,我们证明了我们的结构化框架使一个经过微调的70亿参数开源模型(Qwen2.5VL)达到了90.8%的召回率和93.8%的准确率——超越了所有被评估的模型,同时实现了近乎零成本的本地部署。通过以高准确率自动标注超过9,640张真实世界图像,SAVANT解决了异常检测中关键的数据稀缺问题,为自动驾驶系统提供了迈向可靠、可访问语义监控的实用路径。