Navigating unseen, large-scale environments based on complex and abstract human instructions remains a formidable challenge for autonomous mobile robots. Addressing this requires robots to infer implicit semantics and efficiently explore large-scale task spaces. However, existing methods, ranging from end-to-end learning to foundation model-based modular architectures, often lack the capability to decompose complex tasks or employ efficient exploration strategies, leading to robot aimless wandering or target recognition failures. To address these limitations, we propose VL-Nav, a neuro-symbolic (NeSy) vision-language navigation system. The proposed system intertwines neural reasoning with symbolic guidance through two core components: (1) a NeSy task planner that leverages a symbolic 3D scene graph and image memory system to enhance the vision language models' (VLMs) neural reasoning capabilities for task decomposition and replanning; and (2) a NeSy exploration system that couples neural semantic cues with the symbolic heuristic function to efficiently gather the task-related information while minimizing unnecessary repeat travel during exploration. Validated on the DARPA TIAMAT Challenge navigation tasks, our system achieved an 83.4% success rate (SR) in indoor environments and 75% in outdoor scenarios. VL-Nav achieved an 86.3% SR in real-world experiments, including a challenging 483-meter run. Finally, we validate the system with complex instructions in a 3D multi-floor scenario.
翻译:基于复杂抽象的人类指令在未知大规模环境中导航,仍是自主移动机器人面临的重大挑战。解决该问题要求机器人能够推断隐含语义并高效探索大规模任务空间。然而,现有方法(从端到端学习到基于基础模型的模块化架构)往往缺乏分解复杂任务或采用高效探索策略的能力,导致机器人出现无效漫游或目标识别失败。针对上述局限,我们提出VL-Nav——一种神经符号(NeSy)视觉语言导航系统。该系统通过两大核心模块将神经推理与符号引导深度融合:(1)NeSy任务规划器,利用符号化3D场景图与图像记忆系统,增强视觉语言模型(VLM)在任务分解与重规划中的神经推理能力;(2)NeSy探索系统,通过耦合神经语义线索与符号化启发式函数,在探索过程中高效收集任务相关信息,同时最小化不必要的重复路径。在DARPA TIAMAT挑战赛导航任务中测试表明,系统在室内环境达到83.4%的成功率(SR),室外场景达75%。真实世界实验中,VL-Nav实现86.3%的SR,其中包含一次长达483米的高难度导航任务。最后,我们基于三维多楼层场景中的复杂指令对系统进行了验证。