The relationships between objects and language are fundamental to meaningful communication between humans and AI, and to practically useful embodied intelligence. We introduce HieraNav, a multi-granularity, open-vocabulary goal navigation task where agents interpret natural language instructions to reach targets at four semantic levels: scene, room, region, and instance. To this end, we present Language as a Map (LangMap), a large-scale benchmark built on real-world 3D indoor scans with comprehensive human-verified annotations and tasks spanning these levels. LangMap provides region labels, discriminative region descriptions, discriminative instance descriptions covering 414 object categories, and over 18K navigation tasks. Each target features both concise and detailed descriptions, enabling evaluation across different instruction styles. LangMap achieves superior annotation quality, outperforming GOAT-Bench by 23.8% in discriminative accuracy using four times fewer words. Comprehensive evaluations of zero-shot and supervised models on LangMap reveal that richer context and memory improve success, while long-tailed, small, context-dependent, and distant goals, as well as multi-goal completion, remain challenging. HieraNav and LangMap establish a rigorous testbed for advancing language-driven embodied navigation. Project: https://bo-miao.github.io/LangMap
翻译:物体与语言之间的关系对于人类与人工智能之间的有效交流以及实际可用的具身智能至关重要。我们提出了HieraNav,一种多粒度、开放词汇的目标导航任务,其中智能体通过解析自然语言指令,在四个语义层级(场景、房间、区域和实例)上抵达目标。为此,我们构建了Language as a Map (LangMap),这是一个基于真实世界3D室内扫描的大规模基准数据集,包含全面的人工验证标注以及覆盖上述层级的任务。LangMap提供了区域标签、区分性区域描述、覆盖414个物体类别的区分性实例描述,以及超过18,000个导航任务。每个目标都配有简洁和详细两种描述,从而支持对不同指令风格进行评估。LangMap实现了卓越的标注质量,在区分性准确率上以仅四分之一词汇量超越了GOAT-Bench 23.8%。在LangMap上对零样本和监督模型进行的全面评估表明,更丰富的上下文和记忆能够提升成功率,而长尾分布、小型、上下文依赖以及远距离目标,以及多目标完成,仍然是挑战。HieraNav和LangMap为推进语言驱动的具身导航建立了一个严谨的测试平台。项目地址:https://bo-miao.github.io/LangMap