Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics

Recent advances in generative video modeling, driven by large-scale datasets and powerful architectures, have yielded remarkable visual realism. However, emerging evidence suggests that simply scaling data and model size does not endow these systems with an understanding of the underlying physical laws that govern real-world dynamics. Existing approaches often fail to capture or enforce such physical consistency, resulting in unrealistic motion and dynamics. In his work, we investigate whether integrating the inference of latent physical properties directly into the video generation process can equip models with the ability to produce physically plausible videos. To this end, we propose Phantom, a Physics-Infused Video Generation model that jointly models the visual content and latent physical dynamics. Conditioned on observed video frames and inferred physical states, Phantom jointly predicts latent physical dynamics and generates future video frames. Phantom leverages a physics-aware video representation that serves as an abstract yet informaive embedding of the underlying physics, facilitating the joint prediction of physical dynamics alongside video content without requiring an explicit specification of a complex set of physical dynamics and properties. By integrating the inference of physical-aware video representation directly into the video generation process, Phantom produces video sequences that are both visually realistic and physically consistent. Quantitative and qualitative results on both standard video generation and physics-aware benchmarks demonstrate that Phantom not only outperforms existing methods in terms of adherence to physical dynamics but also delivers competitive perceptual fidelity.

翻译：摘要：近期，依托大规模数据集与强大架构的生成式视频建模取得了显著的视觉真实感。然而，新兴证据表明，单纯扩展数据与模型规模并不能赋予这些系统对支配真实世界动态的物理定律的理解。现有方法常未能捕捉或强化此类物理一致性，导致运动与动态不真实。本研究探究将潜在物理特性的推理直接融入视频生成过程，是否能赋予模型生成物理合理视频的能力。为此，我们提出Phantom——一种物理信息注入的视频生成模型，联合建模视觉内容与潜在物理动力学。基于观测视频帧与推断的物理状态，Phantom同步预测潜在物理动态并生成未来视频帧。该模型利用一种物理感知视频表征，将底层物理抽象为信息丰富但简洁的嵌入，无需明确指定复杂物理动态与属性集，即可联合预测物理动态与视频内容。通过将物理感知视频表征的推理直接集成至视频生成过程，Phantom生成的视频序列既视觉真实又物理一致。在标准视频生成与物理感知基准上的定量与定性结果表明，Phantom不仅在物理动态遵循度上超越现有方法，同时保持了具有竞争力的感知保真度。