Mining hard, safety-critical scenes from driving logs is bottlenecked by the absence of difficulty labels, and no single proxy, collision risk, trajectory ambiguity, or semantic rarity suffices to find such scenes on its own. We present SceneMiner, a unified, camera-only bird's-eye-view pipeline that emits complementary mining signals from a frozen vision-language backbone in a single forward pass, with no LiDAR or radar: a retrieval embedding for text-prompted scenario search, a multi-label scene-tag distribution, and a continuous physics-based risk score (a motion forecast is a byproduct, not a contribution). Building such a multi-head model exposes our central finding, a failure mode we term cross-task interference: adding or upgrading one head shifts a shared activation stream and degrades weight-frozen sibling heads, so freezing parameters alone is insufficient. Our contribution, identity-preserving multi-task fine-tuning, removes this interference by zero-initializing every new sub-module and freezing every parameter that feeds the shared stream. The mining heads are thereby preserved bit-identically while training only ~102k parameters. The tagging head reaches mAP 0.4614 (micro-F1 0.5557) on 20 scene tags by pooling each scene into 32 visual tokens, and the embedding head supports text-prompted retrieval, validated qualitatively. Code is available at: https://anonymous.4open.science/r/sceneminer_anonymous-64E5
翻译:从驾驶日志中挖掘高难度、安全关键场景的难点在于缺乏难度标签,且碰撞风险、轨迹歧义、语义稀有度等单一代理指标均不足以独立完成场景识别。我们提出SceneMiner——一种统一的纯相机鸟瞰视图流水线,可在单次前向传播中通过冻结的视觉语言骨干网络同步输出互补性挖掘信号(无需激光雷达或雷达):用于文本提示场景检索的检索嵌入、多标签场景标记分布,以及基于物理的连续风险评分(运动预测为副产品,非核心贡献)。构建此类多头模型揭示了一个核心发现——我们称之为跨任务干扰的失效模式:新增或优化任一预测头会改变共享激活流并损害冻结的兄弟预测头,因此单纯冻结参数不足以解决问题。我们的贡献——身份保持多任务微调方法——通过零初始化所有新增子模块并冻结所有连接共享流的参数来消除这种干扰,使得挖掘头在仅训练约10.2万个参数时仍能保持比特级一致性。基于20个场景标签的标注头通过将每个场景池化为32个视觉令牌达到mAP 0.4614(微F1 0.5557),而检索头支持文本提示检索(定性验证通过)。代码地址:https://anonymous.4open.science/r/sceneminer_anonymous-64E5