Action-conditioned video prediction models (often referred to as world models) have shown strong potential for robotics applications, but existing approaches are often slow and struggle to capture physically consistent interactions over long horizons, limiting their usefulness for scalable robot policy training and evaluation. We present Interactive World Simulator, a framework for building interactive world models from a moderate-sized robot interaction dataset. Our approach leverages consistency models for both image decoding and latent-space dynamics prediction, enabling fast and stable simulation of physical interactions. In our experiments, the learned world models produce interaction-consistent pixel-level predictions and support stable long-horizon interactions for more than 10 minutes at 15 FPS on a single RTX 4090 GPU. Our framework enables scalable demonstration collection solely within the world models to train state-of-the-art imitation policies. Through extensive real-world evaluation across diverse tasks involving rigid objects, deformable objects, object piles, and their interactions, we find that policies trained on world-model-generated data perform comparably to those trained on the same amount of real-world data. Additionally, we evaluate policies both within the world models and in the real world across diverse tasks, and observe a strong correlation between simulated and real-world performance. Together, these results establish the Interactive World Simulator as a stable and physically consistent surrogate for scalable robotic data generation and faithful, reproducible policy evaluation.
翻译:动作条件视频预测模型(通常称为世界模型)在机器人应用中展现出巨大潜力,但现有方法往往速度较慢且难以捕捉长时域内物理一致的交互,限制了其在可扩展机器人策略训练与评估中的实用性。本文提出交互式世界模拟器,这是一个基于中等规模机器人交互数据集构建交互式世界模型的框架。我们的方法在图像解码和潜空间动态预测中均采用一致性模型,实现了物理交互的快速稳定仿真。实验表明,学习得到的世界模型能生成交互一致的像素级预测,并在单块RTX 4090 GPU上以15 FPS的速率支持超过10分钟的稳定长时域交互。该框架支持完全在世界模型内部进行可扩展的示范数据收集,用于训练最先进的模仿策略。通过对刚性物体、可变形物体、物体堆及其交互的多样化任务进行大量现实世界评估,我们发现基于世界模型生成数据训练的策略与等量真实世界数据训练的策略表现相当。此外,我们在世界模型内部和现实世界中跨多种任务评估策略,观察到仿真性能与现实世界性能之间存在强相关性。这些结果共同表明,交互式世界模拟器可作为稳定且物理一致的替代方案,用于可扩展机器人数据生成与可靠、可复现的策略评估。