Autonomous driving systems face significant challenges in handling unpredictable edge-case scenarios, such as adversarial pedestrian movements, dangerous vehicle maneuvers, and sudden environmental changes. Current end-to-end driving models struggle with generalization to these rare events due to limitations in traditional detection and prediction approaches. To address this, we propose INSIGHT (Integration of Semantic and Visual Inputs for Generalized Hazard Tracking), a hierarchical vision-language model (VLM) framework designed to enhance hazard detection and edge-case evaluation. By using multimodal data fusion, our approach integrates semantic and visual representations, enabling precise interpretation of driving scenarios and accurate forecasting of potential dangers. Through supervised fine-tuning of VLMs, we optimize spatial hazard localization using attention-based mechanisms and coordinate regression techniques. Experimental results on the BDD100K dataset demonstrate a substantial improvement in hazard prediction straightforwardness and accuracy over existing models, achieving a notable increase in generalization performance. This advancement enhances the robustness and safety of autonomous driving systems, ensuring improved situational awareness and potential decision-making in complex real-world scenarios.
翻译:自动驾驶系统在处理不可预测的边界案例场景(如对抗性行人运动、危险车辆操控及突发环境变化)时面临重大挑战。现有端到端驾驶模型因传统检测与预测方法的局限性,难以泛化至此类罕见事件。为此,我们提出INSIGHT(面向广义危险追踪的语义与视觉输入集成框架),这是一种分层视觉-语言模型框架,旨在增强危险检测与边界案例评估能力。通过多模态数据融合,该方法整合语义与视觉表征,实现对驾驶场景的精准解析及潜在危险的准确预测。基于视觉-语言模型的监督微调,我们利用注意力机制与坐标回归技术优化空间危险定位。在BDD100K数据集上的实验表明,相比现有模型,本方法在危险预测直接性与准确性方面取得显著提升,泛化性能获得显著增强。这一进展提升了自动驾驶系统的鲁棒性与安全性,确保在复杂真实场景中具备更优的情境感知与潜在决策能力。