Solving long-horizon tasks requires robots to integrate high-level semantic reasoning with low-level physical interaction. While vision-language models (VLMs) and video generation models can decompose tasks and imagine outcomes, they often lack the physical grounding necessary for real-world execution. We introduce NovaPlan, a hierarchical framework that unifies closed-loop VLM and video planning with geometrically grounded robot execution for zero-shot long-horizon manipulation. At the high level, a VLM planner decomposes tasks into sub-goals and monitors robot execution in a closed loop, enabling the system to recover from single-step failures through autonomous re-planning. To compute low-level robot actions, we extract and utilize both task-relevant object keypoints and human hand poses as kinematic priors from the generated videos, and employ a switching mechanism to choose the better one as a reference for robot actions, maintaining stable execution even under heavy occlusion or depth inaccuracy. We demonstrate the effectiveness of NovaPlan on three long-horizon tasks and the Functional Manipulation Benchmark (FMB). Our results show that NovaPlan can perform complex assembly tasks and exhibit dexterous error recovery behaviors without any prior demonstrations or training. Project page: https://nova-plan.github.io/
翻译:解决长时程任务要求机器人将高层语义推理与低层物理交互相结合。尽管视觉语言模型(VLMs)和视频生成模型能够分解任务并预测结果,但它们通常缺乏现实世界执行所需的物理基础。我们提出了NovaPlan,一种分层框架,它将闭环VLM与视频规划同几何基础的机器人执行相统一,用于零样本长时程操作。在高层,VLM规划器将任务分解为子目标,并以闭环方式监控机器人执行,使系统能够通过自主重新规划从单步失败中恢复。为了计算低层机器人动作,我们从生成的视频中提取并利用任务相关的物体关键点和人手姿态作为运动学先验,并采用切换机制选择更优者作为机器人动作的参考,即使在严重遮挡或深度不准确的情况下也能保持稳定执行。我们在三个长时程任务和功能操作基准测试(FMB)上验证了NovaPlan的有效性。结果表明,NovaPlan能够执行复杂的装配任务,并展现出灵巧的错误恢复行为,而无需任何先前的演示或训练。项目页面:https://nova-plan.github.io/