Video generation models (VGMs) have received extensive attention recently and serve as promising candidates for general-purpose large vision models. While they can only generate short videos each time, existing methods achieve long video generation by iteratively calling the VGMs, using the last-frame output as the condition for the next-round generation. However, the last frame only contains short-term fine-grained information about the scene, resulting in inconsistency in the long horizon. To address this, we propose an Omni World modeL (Owl-1) to produce long-term coherent and comprehensive conditions for consistent long video generation. As videos are observations of the underlying evolving world, we propose to model the long-term developments in a latent space and use VGMs to film them into videos. Specifically, we represent the world with a latent state variable which can be decoded into explicit video observations. These observations serve as a basis for anticipating temporal dynamics which in turn update the state variable. The interaction between evolving dynamics and persistent state enhances the diversity and consistency of the long videos. Extensive experiments show that Owl-1 achieves comparable performance with SOTA methods on VBench-I2V and VBench-Long, validating its ability to generate high-quality video observations. Code: https://github.com/huang-yh/Owl.
翻译:视频生成模型(VGMs)近期受到广泛关注,并被视为通用大规模视觉模型的有力候选者。然而,每次只能生成短片段视频,现有方法通过迭代调用VGMs、将上一轮输出的最后一帧作为下一轮生成的条件来实现长视频生成。但最后一帧仅包含场景的短期细粒度信息,导致长序列生成不一致。为解决此问题,我们提出了一种全方位世界模型(Owl-1),为一致的长视频生成提供长期连贯且全面的条件。由于视频是对底层演化世界的观测,我们提出在潜在空间中建模长期发展过程,并利用VGMs将其“拍摄”为视频。具体而言,我们使用一个潜在状态变量表示世界,该变量可解码为显式的视频观测。这些观测构成预测时序动态的基础,而动态反过来更新状态变量。演化动态与持久状态之间的交互增强了长视频的多样性与一致性。大量实验表明,Owl-1在VBench-I2V和VBench-Long基准测试中与最先进方法性能相当,验证了其生成高质量视频观测的能力。代码:https://github.com/huang-yh/Owl。