Foresight: Iterative Reasoning About Clues that Matter for Navigation

Open-world mapless navigation from sparse language instructions requires resolving underspecified goals and inferring which environmental cues are relevant for reaching the goal. For instance, reaching an out-of-view destination may require interpreting ramps, signs, or detours that reveal where to go or which route to take. Prior works are limited by their reliance on known navigation factors and closed-set factor categories, or identify cues before motion planning and miss plan-dependent cues. We argue that pretrained Vision-Language Models (VLMs) can discover novel instruction-relevant cues, but require adaptation to focus on which cues matter and how they should influence motion planning. We realize these ideas in Foresight, a test-time framework in which a finetuned VLM alternates between proposing image-space motion plans and critiquing them using the language goal and visual context. Subsequent plans are conditioned on prior critiques, enabling iterative motion refinement before execution. To align plan critiques and refinements with open-set behavior preferences, we learn a reward model from human feedback and use it to post-train the VLM with reinforcement learning in the plan-critique loop. In offline evaluations and 6 real-world environments, Foresight improves average task success by 37% and reduces interventions per mission by 52% relative to state-of-the-art test-time reasoning and foundation-model baselines, while running in real-time on a Jetson AGX Orin. We will release code, data, and training details to support future work on test-time reasoning for robot motion refinement. Additional videos at: https://amrl.cs.utexas.edu/foresight

翻译：从稀疏语言指令进行开放世界无地图导航需要解析未明确定义的目标，并推断哪些环境线索与达成目标相关。例如，到达视野外的目的地可能需要解读指示行进方向或路径的坡道、标志或绕行路线。现有方法受限于依赖已知导航因子和封闭式因子类别，或在运动规划前识别线索而遗漏依赖规划的线索。我们论证预训练的视觉语言模型能够发现新的指令相关线索，但需通过适配来聚焦哪些线索重要及其应如何影响运动规划。我们通过Foresight框架实现这些理念，这是一个测试时框架，其中微调后的VLM交替提出图像空间的运动规划，并使用语言目标与视觉上下文对其进行批判。后续规划基于先前批判进行调整，从而在执行前实现迭代式运动优化。为对齐开放行为偏好的规划批判与改进，我们从人类反馈中学习奖励模型，并利用强化学习在规划-批判循环中对VLM进行后训练。在离线评估和6个真实世界环境中，Foresight相比最先进的测试时推理与基础模型基线，平均任务成功率提升37%，每次任务干预次数减少52%，且能在Jetson AGX Orin上实时运行。我们将发布代码、数据及训练细节，以支持机器人运动优化的测试时推理相关后续研究。更多视频请访问：https://amrl.cs.utexas.edu/foresight