DreamPlan: Efficient Reinforcement Fine-Tuning of Vision-Language Planners via Video World Models

Robotic manipulation requires sophisticated commonsense reasoning, a capability naturally possessed by large-scale Vision-Language Models (VLMs). While VLMs show promise as zero-shot planners, their lack of grounded physical understanding often leads to compounding errors and low success rates when deployed in complex real-world environments, particularly for challenging tasks like deformable object manipulation. Although Reinforcement Learning (RL) can adapt these planners to specific task dynamics, directly fine-tuning VLMs via real-world interaction is prohibitively expensive, unsafe, and sample-inefficient. To overcome this bottleneck, we introduce DreamPlan, a novel framework for the reinforcement fine-tuning of VLM planners via video world models. Instead of relying on costly physical rollouts, DreamPlan first leverages the zero-shot VLM to collect exploratory interaction data. We demonstrate that this sub-optimal data is sufficient to train an action-conditioned video generation model, which implicitly captures complex real-world physics. Subsequently, the VLM planner is fine-tuned entirely within the "imagination" of this video world model using Odds Ratio Policy Optimization (ORPO). By utilizing these virtual rollouts, physical and task-specific knowledge is efficiently injected into the VLM. Our results indicate that DreamPlan bridges the gap between semantic reasoning and physical grounding, significantly improving manipulation success rates without the need for large-scale real-world data collection. Our project page is https://psi-lab.ai/DreamPlan/.

翻译：机器人操作需要复杂的常识推理能力，这正是大规模视觉语言模型（VLMs）天然具备的能力。虽然VLMs作为零样本规划器展现出潜力，但由于缺乏对物理世界的具体理解，在复杂的真实环境中部署时（尤其是对于可变形物体操作等具有挑战性的任务），常常会导致误差累积和成功率低下。尽管强化学习（RL）可以使这些规划器适应特定的任务动态，但通过真实世界交互直接对VLMs进行微调成本极高、不安全且样本效率低下。为克服这一瓶颈，我们提出了DreamPlan，一个通过视频世界模型对VLM规划器进行强化微调的新颖框架。DreamPlan不依赖昂贵的物理环境交互，而是首先利用零样本VLM收集探索性交互数据。我们证明，这些次优数据足以训练一个动作条件视频生成模型，该模型能够隐式地捕捉复杂的真实世界物理规律。随后，使用几率比策略优化（ORPO）完全在该视频世界模型的“想象”空间内对VLM规划器进行微调。通过利用这些虚拟交互，物理知识和特定任务知识被高效地注入到VLM中。我们的结果表明，DreamPlan弥合了语义推理与物理基础之间的差距，在无需大规模真实世界数据收集的情况下，显著提高了操作成功率。项目页面为 https://psi-lab.ai/DreamPlan/。