PIGEON: VLM-Driven Object Navigation via Points of Interest Selection

Object navigation in unseen indoor environments requires agents to perform semantic search under partial observability. Vision-language models (VLMs) provide strong semantic-spatial priors for this task, but how to interface them with robot navigation remains challenging: dense VLM inference is expensive, while abstracting environments into symbolic memories often separates high-level reasoning from the raw visual evidence that supports it. We propose we propose PIGEON (Point of Interest Guided Exploration for Object Navigation), a VLM-driven framework that formulates object navigation as raw-observation-grounded sparse decision problem. PIGEON introduces Points of Interest (PoIs) as sparse visual decision units that couple geometrically executable waypoints with raw egocentric observations. Rather than using VLMs as dense controllers or restricting them to frontier ranking, PIGEON enables VLMs to select among task-critical PoIs, including exploration frontiers, suspected target objects, traversable stairs, and floor-level summaries, while low-level planners execute continuous motion between them. This PoI interface further makes high-level navigation decisions verifiable, allowing us to develop an RLVR pipeline that improves local VLMs without manual Chain-of-Thought annotations. Extensive experiments on Habitat ObjectNav benchmarks show that PIGEON achieves state-of-the-art zero-shot performance, scales consistently with foundation model capacity, and transfers to Active Embodied Question Answering with only prompt modifications. Real-world deployments on physical robots further demonstrate its robustness and efficiency.

翻译：在未知室内环境中的目标导航要求智能体在部分可观测条件下执行语义搜索。视觉-语言模型（VLM）为此任务提供了强大的语义-空间先验知识，但如何将其与机器人导航有效衔接仍存在挑战：密集的VLM推理计算成本高昂，而将环境抽象为符号记忆则往往导致高层推理与支撑其推理的原始视觉证据相分离。本文提出PIGEON（基于兴趣点引导的目标导航探索框架），这是一种VLM驱动的框架，将目标导航建模为基于原始观测的稀疏决策问题。PIGEON引入兴趣点（PoI）作为稀疏视觉决策单元，将几何上可执行的航点与原始自我中心观测相结合。不同于将VLM用作密集控制器或限制其进行前沿排序，PIGEON使VLM能够从任务相关的兴趣点中选择，包括探索前沿、疑似目标物体、可通行楼梯及楼层摘要，同时由底层规划器执行它们之间的连续运动。该PoI接口进一步使高层导航决策具有可验证性，从而支撑我们开发无需人工思维链标注即可改进本地VLM的RLVR流水线。在Habitat ObjectNav基准上的大量实验表明，PIGEON实现了最先进的零样本性能，其表现与基础模型容量一致扩展，且仅需修改提示即可迁移至主动具身问答任务。物理机器人的实际部署进一步验证了其鲁棒性与效率。