Object navigation requires a robot to search for an unobserved target in an unknown environment by deciding where to explore next under partial observability. Effective search resembles human-like exploration: selectively probing visually promising frontiers while relying on spatial memory to avoid redundant revisits. We propose IntentNav, a spatial-visual imitation framework that learns human-like ObjectNav policies from human demonstrations. To infer high-level search intent from low-level human actions, we introduce Frontier-based Human-Intent Labeling, which looks ahead in human demonstrations and labels the frontier that best explains the demonstrator's future search direction. We construct a spatial-visual candidate space, where BEV memory tracks explored regions, unexplored frontiers, and trajectory history, while egocentric visual memory provides semantic cues for each candidate. A VLM policy is trained to select among these grounded candidates, using Intent-Aligned Objective to encourage consistent and human-like exploration. IntentNav achieves state-of-the-art performance on the MP3D, HM3D-v1 and HM3D-v2 ObjectNav benchmarks. The proposed candidate-level navigation interface transfers zero-shot to wheeled, quadruped, and humanoid robots without further VLM fine-tuning. \href{https://anonymous.4open.science/w/IntentNav/}{Project page}.
翻译:物体导航要求机器人在未知环境中,在部分可观测条件下,通过决定下一步探索位置来搜索未发现的目标。高效的搜索类似于人类式探索:选择性探测视觉上具有前景的边界,同时依赖空间记忆避免重复访问。我们提出IntentNav——一种从人类演示中学习类人物体导航策略的空间-视觉模仿框架。为从低层次人类动作中推断高层次搜索意图,我们引入基于边界的人类意图标注方法,该方法前瞻人类演示并标注最能解释演示者未来搜索方向的边界。我们构建了空间-视觉候选空间,其中鸟瞰图记忆追踪已探索区域、未探索边界及轨迹历史,而自我中心视觉记忆为每个候选位置提供语义线索。通过使用意图对齐目标函数,我们训练视觉语言模型策略在已接地候选位置中进行选择,以鼓励一致且类人的探索行为。IntentNav在MP3D、HM3D-v1和HM3D-v2物体导航基准测试中取得了最优性能。所提出的候选级导航接口无需对视觉语言模型进行进一步微调,即可零样本迁移至轮式、四足及人形机器人。\href{https://anonymous.4open.science/w/IntentNav/}{项目主页}。