Current video generation models cannot simulate physical consequences of 3D actions like forces and robotic manipulations, as they lack structural understanding of how actions affect 3D scenes. We present RealWonder, the first real-time system for action-conditioned video generation from a single image. Our key insight is using physics simulation as an intermediate bridge: instead of directly encoding continuous actions, we translate them through physics simulation into visual representations (optical flow and RGB) that video models can process. RealWonder integrates three components: 3D reconstruction from single images, physics simulation, and a distilled video generator requiring only 4 diffusion steps. Our system achieves 13.2 FPS at 480x832 resolution, enabling interactive exploration of forces, robot actions, and camera controls on rigid objects, deformable bodies, fluids, and granular materials. We envision RealWonder opens new opportunities to apply video models in immersive experiences, AR/VR, and robot learning. Our code and model weights are publicly available in our project website: https://liuwei283.github.io/RealWonder/
翻译:当前视频生成模型无法模拟三维动作(如作用力和机器人操作)的物理后果,因为它们缺乏对动作如何影响三维场景的结构性理解。我们提出了RealWonder,首个基于单张图像的实时动作条件视频生成系统。我们的核心洞见是将物理模拟作为中间桥梁:不直接编码连续动作,而是通过物理模拟将其转化为视频模型可处理的视觉表征(光流与RGB)。RealWonder整合了三个组件:单图像三维重建、物理模拟,以及仅需4步扩散过程的蒸馏视频生成器。本系统在480x832分辨率下达到13.2 FPS,支持对刚性物体、可变形体、流体和颗粒材料进行作用力、机器人动作及相机控制的交互式探索。我们展望RealWonder将为视频模型在沉浸式体验、AR/VR及机器人学习领域的应用开辟新机遇。代码与模型权重已在项目网站公开:https://liuwei283.github.io/RealWonder/