Recent advances in video generation can produce realistic, minute-long single-shot videos with scalable diffusion transformers. However, real-world narrative videos require multi-shot scenes with visual and dynamic consistency across shots. In this work, we introduce Long Context Tuning (LCT), a training paradigm that expands the context window of pre-trained single-shot video diffusion models to learn scene-level consistency directly from data. Our method expands full attention mechanisms from individual shots to encompass all shots within a scene, incorporating interleaved 3D position embedding and an asynchronous noise strategy, enabling both joint and auto-regressive shot generation without additional parameters. Models with bidirectional attention after LCT can further be fine-tuned with context-causal attention, facilitating auto-regressive generation with efficient KV-cache. Experiments demonstrate single-shot models after LCT can produce coherent multi-shot scenes and exhibit emerging capabilities, including compositional generation and interactive shot extension, paving the way for more practical visual content creation. See https://guoyww.github.io/projects/long-context-video/ for more details.
翻译:近期视频生成技术的进展已能通过可扩展扩散Transformer模型生成逼真、长达一分钟的单镜头视频。然而,真实世界的叙事视频需要包含多个镜头,并保持镜头间视觉与动态的一致性。本研究提出长上下文调优(LCT),一种通过扩展预训练单镜头视频扩散模型的上下文窗口,使其直接从数据中学习场景级一致性的训练范式。该方法将完整注意力机制从单个镜头扩展到涵盖场景内的所有镜头,结合交错式3D位置编码与异步噪声策略,能够在无需额外参数的情况下实现联合与自回归的镜头生成。经过LCT训练并具备双向注意力的模型,可进一步通过上下文因果注意力进行微调,从而支持基于高效KV缓存的自回归生成。实验表明,经LCT调优后的单镜头模型能够生成连贯的多镜头场景,并展现出组合生成与交互式镜头扩展等新兴能力,为更实用的视觉内容创作开辟了道路。更多细节请参见 https://guoyww.github.io/projects/long-context-video/。