We introduce PerpetualWonder, a hybrid generative simulator that enables long-horizon, action-conditioned 4D scene generation from a single image. Current works fail at this task because their physical state is decoupled from their visual representation, which prevents generative refinements to update the underlying physics for subsequent interactions. PerpetualWonder solves this by introducing the first true closed-loop system. It features a novel unified representation that creates a bidirectional link between the physical state and visual primitives, allowing generative refinements to correct both the dynamics and appearance. It also introduces a robust update mechanism that gathers supervision from multiple viewpoints to resolve optimization ambiguity. Experiments demonstrate that from a single image, PerpetualWonder can successfully simulate complex, multi-step interactions from long-horizon actions, maintaining physical plausibility and visual consistency.
翻译:本文提出PerpetualWonder,一种混合生成式模拟器,能够从单张图像生成基于动作条件的长期视野4D场景。现有方法在此任务上存在局限,因其物理状态与视觉表征相互解耦,导致生成式优化无法更新底层物理状态以支持后续交互。PerpetualWonder通过引入首个真正的闭环系统解决了这一问题。其核心是一种新颖的统一表征方法,在物理状态与视觉基元之间建立双向链接,使得生成式优化能够同时修正动力学特性与外观表现。该系统还设计了鲁棒的更新机制,通过多视角监督信息来消除优化过程中的歧义性。实验表明,PerpetualWonder能够从单张图像成功模拟长期动作序列下的复杂多步交互过程,同时保持物理合理性与视觉一致性。