Language-conditioned goal navigation (LGN) requires agents to locate user-specified targets without step-by-step guidance. However, existing benchmarks largely focus on category-level goals or rely on instance descriptions generated by vision-language models (VLMs), which often contain ambiguities and semantic errors, limiting systematic and reliable evaluation. We introduce HieraNav, an open-vocabulary LGN task with goals specified at four hierarchical semantic levels: scene, room, region, and instance. To this end, we present Language as a Map (LangMap), to our knowledge the first real-world 3D indoor navigation benchmark with human-verified semantic annotations to support tasks across all four goal levels. LangMap provides region labels and discriminative region and instance descriptions covering 414 object categories, produced through a rigorous contrastive annotation protocol comparing same-scene regions and instances, and contains over 18K tasks. Each target is paired with concise and detailed descriptions, enabling evaluation across instruction styles. Quantitative and qualitative analyses validate our annotation quality; notably, our instance descriptions outperform GOAT-Bench annotations by 23 percentage points in text-to-view matching. We further introduce PlaNaVid, a strong RGB-only baseline that combines Bounded Diverse Memory (BDM) with high-level planning to prime a reactive policy for multi-goal navigation. PlaNaVid achieves top-tier success rates without depth, 3D scene representations, or object masks. Further analysis shows that memory and richer context boost performance, while long-tailed categories, small objects, distant targets, and multi-goal completion remain open challenges. The benchmark is available at https://bo-miao.github.io/LangMap
翻译:语言条件目标导航(LGN)要求智能体在无逐步引导的情况下定位用户指定的目标。然而,现有基准大多聚焦于类别级目标,或依赖由视觉-语言模型(VLM)生成的实例描述,这些描述常存在歧义与语义错误,限制了系统化与可靠的评估。我们提出HieraNav——一种开放词汇的LGN任务,目标在四个层级化的语义层次上指定:场景、房间、区域与实例。为此,我们构建了"语言即地图"(LangMap),据我们所知,这是首个具有人工验证语义标注的真实世界3D室内导航基准,支持全部四个目标层级任务。LangMap通过严格的对比标注协议(对比同一场景内的区域与实例)提供了涵盖414个物体类别的区域标签及具有区分性的区域与实例描述,包含超过1.8万个任务。每个目标配有简洁与详细两种描述,支持不同指令风格的评估。定量与定性分析验证了我们的标注质量;值得注意的是,我们的实例描述在文本-视图匹配中比GOAT-Bench标注提升了23个百分点。我们进一步提出PlaNaVid——一种仅依赖RGB图像的强基线方法,它结合了有界多样性记忆(BDM)与高层规划,为多目标导航激活反应式策略。PlaNaVid在不依赖深度信息、3D场景表示或物体掩膜的情况下实现了顶尖成功率。进一步分析表明,记忆与更丰富的上下文能提升性能,而长尾类别、小物体、远距离目标及多目标完成仍为开放挑战。该基准可于https://bo-miao.github.io/LangMap获取。