Humans have a natural ability to perform semantic associations with the surrounding objects in the environment. This allows them to create a mental map of the environment, allowing them to navigate on-demand when given linguistic instructions. A natural goal in Vision Language Navigation (VLN) research is to impart autonomous agents with similar capabilities. Recent works take a step towards this goal by creating a semantic spatial map representation of the environment without any labeled data. However, their representations are limited for practical applicability as they do not distinguish between different instances of the same object. In this work, we address this limitation by integrating instance-level information into spatial map representation using a community detection algorithm and utilizing word ontology learned by large language models (LLMs) to perform open-set semantic associations in the mapping representation. The resulting map representation improves the navigation performance by two-fold (233%) on realistic language commands with instance-specific descriptions compared to the baseline. We validate the practicality and effectiveness of our approach through extensive qualitative and quantitative experiments.
翻译:人类天生具备对环境中周围物体进行语义关联的能力。这使得他们能够构建环境心理地图,在接收到语言指令时按需导航。视觉语言导航研究的自然目标之一是赋予自主智能体类似能力。近期研究通过无标注数据创建环境语义空间地图表征,朝此目标迈出一步。然而,这些表征因无法区分同一物体的不同实例而限制了实际应用性。本研究通过以下方式解决该限制:使用社区检测算法将实例级信息融入空间地图表征,并利用大型语言模型习得的词汇本体在映射表征中执行开放集语义关联。相较于基线,所提地图表征将包含实例特定描述的现实语言指令的导航性能提升两倍(233%)。我们通过大量定性与定量实验验证了该方法的实用性与有效性。