The development of robust Autonomous Vehicles (AVs) is bottlenecked by the scarcity of "Long-Tail" training data. While fleets collect petabytes of video logs, identifying rare safety-critical events (e.g., erratic jaywalking, construction diversions) remains a manual, cost-prohibitive process. Existing solutions rely on coarse metadata search, which lacks precision, or cloud-based VLMs, which are privacy-invasive and expensive. We introduce Semantic-Drive, a local-first, neuro-symbolic framework for semantic data mining. Our approach decouples perception into two stages: (1) Symbolic Grounding via a real-time open-vocabulary detector (YOLOE) to anchor attention, and (2) Cognitive Analysis via a Reasoning VLM that performs forensic scene analysis. To mitigate hallucination, we implement a "System 2" inference-time alignment strategy, utilizing a multi-model "Judge-Scout" consensus mechanism. Benchmarked on the nuScenes dataset against the Waymo Open Dataset (WOD-E2E) taxonomy, Semantic-Drive achieves a Recall of 0.966 (vs. 0.475 for CLIP) and reduces Risk Assessment Error by 40% ccompared to the best single scout models. The system runs entirely on consumer hardware (NVIDIA RTX 3090), offering a privacy-preserving alternative to the cloud.
翻译:鲁棒自动驾驶系统的发展受限于'长尾'训练数据的稀缺性。尽管车队收集了海量视频日志,但识别罕见的安全关键事件(如不规则乱穿马路、施工绕行)仍依赖于人工且成本高昂的过程。现有解决方案依赖于粗粒度的元数据搜索(缺乏精确性)或基于云的视觉语言模型(存在隐私侵犯且昂贵)。我们提出了语义驱动,一种本地优先的神经符号语义数据挖掘框架。该方法将感知解耦为两个阶段:(1)通过实时开放词汇检测器(YOLOE)进行符号定位以锚定注意力,(2)通过推理视觉语言模型进行认知分析,执行场景取证分析。为缓解幻觉问题,我们实施了'系统2'推理时对齐策略,采用多模型'法官-侦察员'共识机制。在nuScenes数据集上以Waymo开放数据集分类体系为基准进行评估,语义驱动实现了0.966的召回率(对比CLIP的0.475),并将风险评估误差较最佳单侦察员模型降低40%。该系统完全在消费级硬件(NVIDIA RTX 3090)上运行,为云端方案提供了隐私保护的替代方案。