Multi-view 3D visual grounding is critical for autonomous driving vehicles to interpret natural languages and localize target objects in complex environments. However, existing datasets and methods suffer from coarse-grained language instructions, and inadequate integration of 3D geometric reasoning with linguistic comprehension. To this end, we introduce NuGrounding, the first large-scale benchmark for multi-view 3D visual grounding in autonomous driving. We present a Hierarchy of Grounding (HoG) method to construct NuGrounding to generate hierarchical multi-level instructions, ensuring comprehensive coverage of human instruction patterns. To tackle this challenging dataset, we propose a novel paradigm that seamlessly combines instruction comprehension abilities of multi-modal LLMs (MLLMs) with precise localization abilities of specialist detection models. Our approach introduces two decoupled task tokens and a context query to aggregate 3D geometric information and semantic instructions, followed by a fusion decoder to refine spatial-semantic feature fusion for precise localization. Extensive experiments demonstrate that our method significantly outperforms the baselines adapted from representative 3D scene understanding methods by a significant margin and achieves 0.59 in precision and 0.64 in recall, with improvements of 50.8% and 54.7%.
翻译:多视角三维视觉定位对于自动驾驶车辆在复杂环境中解析自然语言并定位目标物体至关重要。然而,现有数据集和方法存在语言指令粒度粗糙、三维几何推理与语言理解融合不足的问题。为此,我们提出了NuGrounding,首个面向自动驾驶的大规模多视角三维视觉定位基准。我们提出了一种层次化定位(HoG)方法来构建NuGrounding,以生成层次化的多级指令,确保全面覆盖人类指令模式。为应对这一具有挑战性的数据集,我们提出了一种新颖的范式,将多模态大语言模型(MLLMs)的指令理解能力与专业检测模型的精确定位能力无缝结合。我们的方法引入了两个解耦的任务令牌和一个上下文查询,以聚合三维几何信息和语义指令,随后通过一个融合解码器来优化空间-语义特征融合,实现精确定位。大量实验表明,我们的方法显著优于从代表性三维场景理解方法改编的基线模型,在精确率和召回率上分别达到0.59和0.64,提升了50.8%和54.7%。