ImagineNav++: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination

Visual navigation is a fundamental capability for autonomous home-assistance robots, enabling long-horizon tasks such as object search. While recent methods have leveraged Large Language Models (LLMs) to incorporate commonsense reasoning and improve exploration efficiency, their planning remains constrained by textual representations, which cannot adequately capture spatial occupancy or scene geometry--critical factors for navigation decisions. We explore whether Vision-Language Models (VLMs) can achieve mapless visual navigation using only onboard RGB/RGB-D streams, unlocking their potential for spatial perception and planning. We achieve this through an imagination-powered navigation framework, ImagineNav++, which imagines future observation images from candidate robot views and translates navigation planning into a simple best-view image selection problem for VLMs. First, a future-view imagination module distills human navigation preferences to generate semantically meaningful viewpoints with high exploration potential. These imagined views then serve as visual prompts for the VLM to identify the most informative viewpoint. To maintain spatial consistency, we develop a selective foveation memory mechanism, which hierarchically integrates keyframe observations via a sparse-to-dense framework, constructing a compact yet comprehensive memory for long-term spatial reasoning. This approach transforms goal-oriented navigation into a series of tractable point-goal navigation tasks. Extensive experiments on open-vocabulary object and instance navigation benchmarks show that ImagineNav++ achieves SOTA performance in mapless settings, even surpassing most map-based methods, highlighting the importance of scene imagination and memory in VLM-based spatial reasoning.

翻译：视觉导航是自主家庭辅助机器人的基本能力，能够实现物体搜索等长时程任务。尽管现有方法已利用大型语言模型（LLM）融入常识推理并提升探索效率，但其规划仍受限于文本表示，无法充分捕捉空间占用与场景几何结构——这些是导航决策的关键因素。我们探究视觉语言模型（VLM）能否仅通过机载RGB/RGB-D流实现无地图视觉导航，释放其在空间感知与规划方面的潜力。我们通过一个想象力驱动的导航框架ImagineNav++实现这一目标：该框架从候选机器人视角想象未来观测图像，并将导航规划转化为VLM的简单最优视角图像选择问题。首先，未来视角想象模块通过提炼人类导航偏好，生成具有高探索潜力的语义化视点。这些想象视图随后作为视觉提示供VLM识别最具信息量的视点。为保持空间一致性，我们开发了选择性中央凹记忆机制，通过稀疏到稠密的层次化框架整合关键帧观测，构建紧凑而全面的记忆以实现长期空间推理。该方法将目标导向导航转化为一系列可处理的点目标导航任务。在开放词汇物体导航与实例导航基准上的大量实验表明，ImagineNav++在无地图设置中达到最先进性能，甚至超越多数基于地图的方法，凸显了场景想象与记忆在基于VLM的空间推理中的重要性。