SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation

Human beings are endowed with a complementary learning system, which bridges the slow learning of general world dynamics with fast storage of episodic memory from a new experience. Previous video generation models, however, primarily focus on slow learning by pre-training on vast amounts of data, overlooking the fast learning phase crucial for episodic memory storage. This oversight leads to inconsistencies across temporally distant frames when generating longer videos, as these frames fall beyond the model's context window. To this end, we introduce SlowFast-VGen, a novel dual-speed learning system for action-driven long video generation. Our approach incorporates a masked conditional video diffusion model for the slow learning of world dynamics, alongside an inference-time fast learning strategy based on a temporal LoRA module. Specifically, the fast learning process updates its temporal LoRA parameters based on local inputs and outputs, thereby efficiently storing episodic memory in its parameters. We further propose a slow-fast learning loop algorithm that seamlessly integrates the inner fast learning loop into the outer slow learning loop, enabling the recall of prior multi-episode experiences for context-aware skill learning. To facilitate the slow learning of an approximate world model, we collect a large-scale dataset of 200k videos with language action annotations, covering a wide range of scenarios. Extensive experiments show that SlowFast-VGen outperforms baselines across various metrics for action-driven video generation, achieving an FVD score of 514 compared to 782, and maintaining consistency in longer videos, with an average of 0.37 scene cuts versus 0.89. The slow-fast learning loop algorithm significantly enhances performances on long-horizon planning tasks as well. Project Website: https://slowfast-vgen.github.io

翻译：人类天生具备互补学习系统，能够将世界动态的缓慢学习与新体验中情景记忆的快速存储相结合。然而，现有的视频生成模型主要侧重于通过海量数据预训练进行缓慢学习，忽视了情景记忆存储所必需的快速学习阶段。这一缺陷导致在生成长视频时，时间跨度较大的帧之间会出现不一致性，因为这些帧超出了模型的上下文窗口范围。为此，我们提出SlowFast-VGen——一种面向动作驱动长视频生成的新型双速学习系统。该方法包含用于世界动态缓慢学习的掩码条件视频扩散模型，以及基于时序LoRA模块的推理时快速学习策略。具体而言，快速学习过程根据局部输入与输出更新其时序LoRA参数，从而高效地将情景记忆存储于参数中。我们进一步提出慢-快学习循环算法，将内部快速学习循环无缝集成至外部缓慢学习循环，实现多情景先验经验的回溯以进行上下文感知的技能学习。为促进近似世界模型的缓慢学习，我们构建了包含20万条带语言动作标注视频的大规模数据集，涵盖广泛场景。大量实验表明，SlowFast-VGen在动作驱动视频生成的各项指标上均超越基线模型：FVD分数达到514（基线为782），并在长视频中保持连贯性——平均场景切换次数为0.37（基线为0.89）。慢-快学习循环算法在长程规划任务上也显著提升了性能。项目网站：https://slowfast-vgen.github.io