Pretrained video generators are promising visual world models that exhibit emergent task-solving abilities; however, their reliance on detailed textual descriptions limits their direct use for planning and decision-making. Existing approaches either outsource this reasoning to language or vision-language models, or rely on supervised fine-tuning with paired task-execution videos, which are costly to collect and difficult to scale. We propose a scalable framework that elicits task-solving ability in such models by combining self-distillation with reinforcement learning. Given an unlabeled scene image, a vision-language model generates a candidate task and a detailed step-by-step solution. The solution conditions a pretrained video diffusion model, the Demonstrator; we distill its behavior into an Executor conditioned only on the image and a short task prompt. This transfers execution knowledge from caption-guided generation to instruction-conditioned task solving without curated task-video supervision. We further improve the Executor with reinforcement learning from VLM feedback, exploiting the asymmetry between judging whether a sampled video satisfies a task and generating the solution. Experiments on our proposed WorldTasks-Benchmark and the DreamGen robotics benchmark show that the Executor surpasses the Demonstrator under our VLM-based evaluation protocol and transfers competitively to robotic tasks.
翻译:预训练视频生成器作为具有涌现任务解决能力的视觉世界模型具有潜力;然而,它们对详细文本描述的依赖限制了其在规划与决策中的直接应用。现有方法要么将推理任务外包给语言或视觉语言模型,要么依赖带任务执行视频的监督微调——此类数据收集成本高昂且难以规模化。我们提出一个可扩展框架,通过结合自蒸馏与强化学习来激发此类模型的任务解决能力。给定无标注场景图像,视觉语言模型生成候选任务及其逐步解决方案,该方案作为条件输入预训练视频扩散模型(演示者);我们将其行为蒸馏至仅依赖图像与简短任务提示的执行者中。这一过程将描述引导生成中的执行知识迁移至指令条件任务求解框架,无需人工标注的任务视频监督。我们进一步通过视觉语言模型反馈的强化学习优化执行者,利用评估视频样本是否满足任务与生成解决方案之间的非对称性。在我们提出的WorldTasks基准与DreamGen机器人基准上的实验表明,在基于视觉语言模型的评估协议下,执行者性能超越演示者,并能有效迁移至机器人任务。