SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation

Human beings are endowed with a complementary learning system, which bridges the slow learning of general world dynamics with fast storage of episodic memory from a new experience. Previous video generation models, however, primarily focus on slow learning by pre-training on vast amounts of data, overlooking the fast learning phase crucial for episodic memory storage. This oversight leads to inconsistencies across temporally distant frames when generating longer videos, as these frames fall beyond the model's context window. To this end, we introduce SlowFast-VGen, a novel dual-speed learning system for action-driven long video generation. Our approach incorporates a masked conditional video diffusion model for the slow learning of world dynamics, alongside an inference-time fast learning strategy based on a temporal LoRA module. Specifically, the fast learning process updates its temporal LoRA parameters based on local inputs and outputs, thereby efficiently storing episodic memory in its parameters. We further propose a slow-fast learning loop algorithm that seamlessly integrates the inner fast learning loop into the outer slow learning loop, enabling the recall of prior multi-episode experiences for context-aware skill learning. To facilitate the slow learning of an approximate world model, we collect a large-scale dataset of 200k videos with language action annotations, covering a wide range of scenarios. Extensive experiments show that SlowFast-VGen outperforms baselines across various metrics for action-driven video generation, achieving an FVD score of 514 compared to 782, and maintaining consistency in longer videos, with an average of 0.37 scene cuts versus 0.89. The slow-fast learning loop algorithm significantly enhances performances on long-horizon planning tasks as well. Project Website: https://slowfast-vgen.github.io

翻译：人类拥有互补的学习系统，能够将通用世界动态的慢速学习与从新经验中快速存储情景记忆相结合。然而，现有的视频生成模型主要侧重于通过海量数据预训练进行慢速学习，忽视了对于情景记忆存储至关重要的快速学习阶段。这一疏忽导致在生成长视频时，时间上相距较远的帧之间出现不一致，因为这些帧超出了模型的上下文窗口。为此，我们提出了SlowFast-VGen，一种面向动作驱动长视频生成的新型双速学习系统。我们的方法包含一个用于世界动态慢速学习的掩码条件视频扩散模型，以及一个基于时序LoRA模块的推理时快速学习策略。具体而言，快速学习过程根据局部输入和输出更新其时序LoRA参数，从而高效地将情景记忆存储于其参数中。我们进一步提出了一种慢-快学习循环算法，将内部快速学习循环无缝集成到外部慢速学习循环中，使得模型能够回忆先前的多情景经验，以实现上下文感知的技能学习。为了促进近似世界模型的慢速学习，我们收集了一个包含20万个带有语言动作标注视频的大规模数据集，覆盖了广泛的场景。大量实验表明，SlowFast-VGen在动作驱动视频生成的各项指标上均优于基线模型，其FVD分数达到514（基线为782），并在长视频中保持了更好的一致性，平均场景切换次数为0.37（基线为0.89）。慢-快学习循环算法也显著提升了长时程规划任务的性能。项目网站：https://slowfast-vgen.github.io