Current video generation models produce high-quality aesthetic videos but often struggle to learn representations of real-world physics dynamics, resulting in artifacts such as unnatural object collisions, inconsistent gravity, and temporal flickering. In this work, we propose PhysVideoGenerator, a proof-of-concept framework that explicitly embeds a learnable physics prior into the video generation process. We introduce a lightweight predictor network, PredictorP, which regresses high-level physical features extracted from a pre-trained Video Joint Embedding Predictive Architecture (V-JEPA 2) directly from noisy diffusion latents. These predicted physics tokens are injected into the temporal attention layers of a DiT-based generator (Latte) via a dedicated cross-attention mechanism. Our primary contribution is demonstrating the technical feasibility of this joint training paradigm: we show that diffusion latents contain sufficient information to recover V-JEPA 2 physical representations, and that multi-task optimization remains stable over training. This report documents the architectural design, technical challenges, and validation of training stability, establishing a foundation for future large-scale evaluation of physics-aware generative models.
翻译:当前的视频生成模型能够生成高质量的美学视频,但往往难以学习真实世界物理动态的表征,从而导致诸如不自然的物体碰撞、不一致的重力以及时间闪烁等伪影。在本工作中,我们提出了PhysVideoGenerator,这是一个概念验证框架,它明确地将一个可学习的物理先验嵌入到视频生成过程中。我们引入了一个轻量级的预测网络PredictorP,它直接从带噪声的扩散潜在空间中回归从预训练的视频联合嵌入预测架构(V-JEPA 2)中提取的高级物理特征。这些预测的物理令牌通过一个专用的交叉注意力机制注入到基于DiT的生成器(Latte)的时间注意力层中。我们的主要贡献在于证明了这种联合训练范式的技术可行性:我们证明了扩散潜在空间包含足够的信息来恢复V-JEPA 2的物理表征,并且多任务优化在训练过程中保持稳定。本报告记录了架构设计、技术挑战以及训练稳定性的验证,为未来大规模评估物理感知生成模型奠定了基础。