In order to navigate complex traffic environments, self-driving vehicles must recognize many semantic classes pertaining to vulnerable road users or traffic control devices. However, many safety-critical objects (e.g., construction worker) appear infrequently in nominal traffic conditions, leading to a severe shortage of training examples from driving data alone. Recent vision foundation models, which are trained on a large corpus of data, can serve as a good source of external prior knowledge to improve generalization. We propose FOMO-3D, the first multi-modal 3D detector to leverage vision foundation models for long-tailed 3D detection. Specifically, FOMO-3D exploits rich semantic and depth priors from OWLv2 and Metric3Dv2 within a two-stage detection paradigm that first generates proposals with a LiDAR-based branch and a novel camera-based branch, and refines them with attention especially to image features from OWL. Evaluations on real-world driving data show that using rich priors from vision foundation models with careful multi-modal fusion designs leads to large gains for long-tailed 3D detection. Project website is at https://waabi.ai/fomo3d/.
翻译:为使自动驾驶车辆能在复杂的交通环境中导航,其必须识别与弱势道路使用者或交通控制设备相关的众多语义类别。然而,许多对安全至关重要的物体(例如,建筑工人)在常规交通条件下出现频率较低,仅依靠驾驶数据会导致训练样本严重不足。近期在大量数据上训练的视觉基础模型,可作为良好的外部先验知识来源以提升泛化能力。我们提出了FOMO-3D,这是首个利用视觉基础模型进行长尾3D检测的多模态3D检测器。具体而言,FOMO-3D在一个两阶段检测范式中,利用来自OWLv2和Metric3Dv2的丰富语义与深度先验:首先生成基于LiDAR分支和新型基于相机分支的候选区域,随后通过注意力机制(特别针对来自OWL的图像特征)对其进行细化。在真实世界驾驶数据上的评估表明,结合精心设计的多模态融合方案,利用视觉基础模型的丰富先验能为长尾3D检测带来显著性能提升。项目网站位于 https://waabi.ai/fomo3d/。