AgentGrounder: Zero-Shot 3D Visual Pointcloud Grounding using Multimodal Language Models

3D Visual Grounding (3DVG) is an essential capability for embodied AI, requiring agents to localize objects in 3D scenes based on natural language descriptions. Recent zero-shot methods leverage 2D vision-language models (LVLMs). However, they often rely on existing sets of multi-view images and struggle with the limited semantic and spatial details provided by standard 3D segmentation tools. We present $\textbf{AgentGrounder}$, a zero-shot 3D visual grounding framework that operates directly on colored point clouds without task-specific 3D training. Our approach follows a two-stage design: (1) an offline stage that applies 3D model to build an Object Lookup Table (OLT) with instance IDs, semantic labels, 3D bounding boxes; and (2) an online tool-driven agent that decomposes each query, retrieves only relevant candidates from the OLT, performs geometric scoring, and triggers image rendering on demand when additional visual evidence (e.g., color, material, or viewpoint-sensitive cues) is required. Compared with fixed anchor-target matching pipelines, this design reduces cascading matching errors and improves context-window efficiency by avoiding prompts overloaded with irrelevant objects. We evaluate on ScanRefer and Nr3D under a zero-shot setting and observe consistent improvements over SeeGround in our setup, including +2.5% [email protected] on ScanRefer and +6.3% on Nr3D, with a notable +6.3% gain on Nr3D view-independent queries. These results show that combining selective retrieval, geometric reasoning, and adaptive visual inspection yields a practical and robust foundation for open-vocabulary 3D grounding. Our code is available at https://github.com/be2rlab/AgentGrounder.

翻译：三维视觉定位是具身智能的核心能力，要求智能体根据自然语言描述在三维场景中定位目标物体。现有零样本方法多依赖二维视觉语言模型，但通常需要预设的多视角图像集，且受限于标准三维分割工具提供的有限语义与空间细节。我们提出$\textbf{AgentGrounder}$——一种直接对彩色点云进行零样本三维视觉定位的框架，无需任务特定的三维训练。该方法采用两阶段设计：（1）离线阶段，利用三维模型构建包含实例ID、语义标签及三维边界框的对象查找表；（2）在线阶段，通过工具驱动的智能体将查询分解、仅从对象查找表中检索相关候选对象、进行几何评分，并在需要额外视觉证据（如颜色、材质或视角敏感线索）时按需触发图像渲染。与固定锚点-目标匹配流程相比，该设计可减少级联匹配误差，并通过避免加载无关对象来提升上下文窗口效率。我们在ScanRefer与Nr3D数据集上进行零样本评估，相较于SeeGround取得了一致性提升：ScanRefer上[email protected]提升2.5%，Nr3D上提升6.3%，其中Nr3D视角无关查询增益达6.3%。结果表明，选择性检索、几何推理与自适应视觉检查相结合，为开放词汇三维定位提供了实用且鲁棒的基础。代码已开源：https://github.com/be2rlab/AgentGrounder。