Embodied navigation is a fundamental capability for robotic agents operating. Real-world deployment requires open vocabulary generalization and low training overhead, motivating zero-shot methods rather than task-specific RL training. However, existing zero-shot methods that build explicit 3D scene graphs often compress rich visual observations into text-only relations, leading to high construction cost, irreversible loss of visual evidence, and constrained vocabularies. To address these limitations, we introduce the Multi-modal 3D Scene Graph (M3DSG), which preserves visual cues by replacing textual relation
翻译:具身导航是机器人代理操作的基本能力。现实世界部署需要开放词汇泛化能力和低训练开销,这促使了零样本方法而非任务特定强化学习训练的发展。然而,现有构建显式三维场景图的零样本方法通常将丰富的视觉观测压缩为纯文本关系,导致构建成本高、视觉证据不可逆丢失以及词汇受限。为解决这些局限,我们引入了多模态三维场景图,它通过替换文本关系来保留视觉线索