Robotic manipulation requires anticipating how the environment evolves in response to actions, yet most existing systems lack this predictive capability, often resulting in errors and inefficiency. While Vision-Language Models (VLMs) provide high-level guidance, they cannot explicitly forecast future states, and existing world models either predict only short horizons or produce spatially inconsistent frames. To address these challenges, we propose a framework for fast and predictive video-conditioned action. Our approach first selects and adapts a robust video generation model to ensure reliable future predictions, then applies adversarial distillation for fast, few-step video generation, and finally trains an action model that leverages both generated videos and real observations to correct spatial errors. Extensive experiments show that our method produces temporally coherent, spatially accurate video predictions that directly support precise manipulation, achieving significant improvements in embodiment consistency, spatial referring ability, and task completion over existing baselines. Codes & Models will be released.
翻译:机器人操作需要预测环境如何响应动作而演变,然而现有系统大多缺乏这种预测能力,常导致错误与低效。尽管视觉-语言模型(VLMs)能提供高层级指导,却无法显式预测未来状态;现有世界模型则要么仅能预测短时域,要么生成空间不一致的帧序列。为应对这些挑战,我们提出一种面向快速预测性视频条件化操作的框架。该方法首先筛选并适配鲁棒的视频生成模型以确保可靠的未来预测,继而应用对抗蒸馏实现快速少步视频生成,最终训练一个能同时利用生成视频与真实观测以修正空间误差的动作模型。大量实验表明,本方法能生成时间连贯、空间精确的视频预测,直接支撑精准操作,在具身一致性、空间指代能力与任务完成度上均较现有基线取得显著提升。代码与模型将公开。