Robots collaborating with humans must convert natural language goals into actionable, physically grounded decisions. For example, executing a command such as "go two meters to the right of the fridge" requires grounding semantic references, spatial relations, and metric constraints within a 3D scene. While recent vision language models (VLMs) demonstrate strong semantic grounding capabilities, they are not explicitly designed to reason about metric constraints in physically defined spaces. In this work, we empirically demonstrate that state-of-the-art VLM-based grounding approaches struggle with complex metric-semantic language queries. To address this limitation, we propose MAPG (Multi-Agent Probabilistic Grounding), an agentic framework that decomposes language queries into structured subcomponents and queries a VLM to ground each component. MAPG then probabilistically composes these grounded outputs to produce metrically consistent, actionable decisions in 3D space. We evaluate MAPG on the HM-EQA benchmark and show consistent performance improvements over strong baselines. Furthermore, we introduce a new benchmark, MAPG-Bench, specifically designed to evaluate metric-semantic goal grounding, addressing a gap in existing language grounding evaluations. We also present a real-world robot demonstration showing that MAPG transfers beyond simulation when a structured scene representation is available.
翻译:与人类协作的机器人需将自然语言目标转化为可执行的、物理具身的决策。例如,执行"走到冰箱右侧两米处"这类指令时,需要在三维场景中同时完成语义指代、空间关系与度量约束的具身定位。尽管近期视觉语言模型(VLM)展现出强大的语义定位能力,但其设计初衷并非专门处理物理空间中带度量约束的推理任务。本研究通过实验证明,基于最先进VLM的定位方法在处理复杂度量-语义混合语言查询时存在显著困难。为此,我们提出多智能体概率化定位框架(MAPG),该框架将语言查询分解为结构化子组件,并调用VLM对每个组件分别进行定位。随后,MAPG通过概率化组合各定位输出,在三维空间中生成度量一致的可执行决策。我们在HM-EQA基准上评估MAPG,其相较于强基线方法展现出持续性能提升。此外,针对现有语言定位评估中度量语义混合目标定位的空白,我们构建了新基准MAPG-Bench。最后通过真实机器人实验表明,在具备结构化场景表征时,MAPG可成功迁移至仿真环境之外。