CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives

Yihao Meng,Zichen Liu,Hao Ouyang,Qiuyu Wang,Ka Leong Cheng,Yue Yu,Hanlin Wang,Haobo Li,Jiapeng Zhu,Yanhong Zeng,Xing Zhu,Yujun Shen,Qifeng Chen,Huamin Qu

from arxiv, Project page: https://yihao-meng.github.io/CausalCine/

Autoregressive video generation aims at real-time, open-ended synthesis. Yet, cinematic storytelling is not merely the endless extension of a single scene; it requires progressing through evolving events, viewpoint shifts, and discrete shot boundaries. Existing autoregressive models often struggle in this setting. Trained primarily for short-horizon continuation, they treat long sequences as extended single shots, inevitably suffering from motion stagnation and semantic drift during long rollouts. To bridge this gap, we introduce CausalCine, an interactive autoregressive framework that transforms multi-shot video generation into an online directing process. CausalCine generates causally across shot changes, accepts dynamic prompts on the fly, and reuses context without regenerating previous shots. To achieve this, we first train a causal base model on native multi-shot sequences to learn complex shot transitions prior to acceleration. We then propose Content-Aware Memory Routing (CAMR), which dynamically retrieves historical KV entries according to attention-based relevance scores rather than temporal proximity, preserving cross-shot coherence under bounded active memory. Finally, we distill the causal base model into a few-step generator for real-time interactive generation. Extensive experiments demonstrate that CausalCine significantly outperforms autoregressive baselines and approaches the capability of bidirectional models while unlocking the streaming interactivity of causal generation. Demo available at https://yihao-meng.github.io/CausalCine/

翻译：自回归视频生成旨在实现实时、开放式合成。然而，电影叙事并非单个场景的无限延伸，而是需要推进演化事件、视角转换和离散镜头边界。现有自回归模型在此设定下往往难以应对。这些模型主要针对短期延续进行训练，将长序列视为延伸的单镜头，因此在长程推演过程中不可避免地会出现运动停滞和语义漂移。为弥补这一差距，我们提出CausalCine，一种交互式自回归框架，将多镜头视频生成转变为在线导播过程。CausalCine能够跨越镜头切换进行因果生成，实时接受动态提示，并在不重新生成先前镜头的情况下复用上下文。为实现此目标，我们首先在原生多镜头序列上训练因果基座模型，使其在加速前习得复杂镜头转换。随后提出内容感知记忆路由（CAMR），该方法基于注意力相关分数（而非时间邻近性）动态检索历史KV条目，从而在有限活跃记忆下保持跨镜头连贯性。最终，我们将因果基座模型蒸馏为少步生成器，实现实时交互式生成。大量实验表明，CausalCine显著优于自回归基线，在解锁因果生成的流式交互性的同时，其性能逼近双向模型能力。演示地址：https://yihao-meng.github.io/CausalCine/

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

EVATok：面向高效视觉自回归生成的自适应长度视频标记化方法

专知会员服务

6+阅读 · 3月16日

【CVPR2025】ShotAdapter：基于扩散模型的文本生成多镜头视频方法

专知会员服务

11+阅读 · 2025年5月16日