Owl-1: Omni World Model for Consistent Long Video Generation

Video generation models (VGMs) have received extensive attention recently and serve as promising candidates for general-purpose large vision models. While they can only generate short videos each time, existing methods achieve long video generation by iteratively calling the VGMs, using the last-frame output as the condition for the next-round generation. However, the last frame only contains short-term fine-grained information about the scene, resulting in inconsistency in the long horizon. To address this, we propose an Omni World modeL (Owl-1) to produce long-term coherent and comprehensive conditions for consistent long video generation. As videos are observations of the underlying evolving world, we propose to model the long-term developments in a latent space and use VGMs to film them into videos. Specifically, we represent the world with a latent state variable which can be decoded into explicit video observations. These observations serve as a basis for anticipating temporal dynamics which in turn update the state variable. The interaction between evolving dynamics and persistent state enhances the diversity and consistency of the long videos. Extensive experiments show that Owl-1 achieves comparable performance with SOTA methods on VBench-I2V and VBench-Long, validating its ability to generate high-quality video observations. Code: https://github.com/huang-yh/Owl.

翻译：视频生成模型（VGMs）近期受到广泛关注，并被视为通用大规模视觉模型的有力候选者。然而，每次只能生成短片段视频，现有方法通过迭代调用VGMs、将上一轮输出的最后一帧作为下一轮生成的条件来实现长视频生成。但最后一帧仅包含场景的短期细粒度信息，导致长序列生成不一致。为解决此问题，我们提出了一种全方位世界模型（Owl-1），为一致的长视频生成提供长期连贯且全面的条件。由于视频是对底层演化世界的观测，我们提出在潜在空间中建模长期发展过程，并利用VGMs将其“拍摄”为视频。具体而言，我们使用一个潜在状态变量表示世界，该变量可解码为显式的视频观测。这些观测构成预测时序动态的基础，而动态反过来更新状态变量。演化动态与持久状态之间的交互增强了长视频的多样性与一致性。大量实验表明，Owl-1在VBench-I2V和VBench-Long基准测试中与最先进方法性能相当，验证了其生成高质量视频观测的能力。代码：https://github.com/huang-yh/Owl。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日