Long-form video generation requires recurring subjects to remain consistent across various shots, viewpoints, motions, and scene transitions. Existing temporal decomposition methods improve scalability by generating videos shot by shot. However, they mainly focus on optimizing plausible next-shot continuations without verifying whether the historical memory preserves identity-critical subject evidence. Consequently, as generation proceeds, recurring subjects may be diluted, overwritten, or forgotten. In this paper, we propose Memento, a subject-reconstruction-guided framework that treats subject preservation as an explicit identity grounding problem, based on the premise that a memory bank faithfully preserving a subject should support reconstructing that subject from memory alone. Specifically, Memento jointly trains autoregressive next-shot generation with memory-based subject reconstruction, recovering target appearances using historical memory and global story captions. To disentangle long-range subject evidence from short-range cues, Memento introduces a dual-query memory mechanism, where one query retrieves identity-relevant memory and the other selects short-context keyframes for coherent continuation. Additionally, a subject-aware cinematic data pipeline provides precise reconstruction supervision via consistent, pronoun-free subject descriptions. Experiments demonstrate that Memento achieves state-of-the-art performance in long-term subject consistency, cross-shot coherence, and visual quality.
翻译:长视频生成要求重复出现的主题在不同镜头、视角、运动和场景切换中保持一致。现有时间分解方法通过逐镜头生成视频提升了可扩展性,但主要着眼于优化合理的下一镜头延续,而未验证历史记忆是否保留了关键身份特征。因此,随着生成过程的推进,重复出现的主题可能被稀释、覆盖或遗忘。本文提出Memento——一种基于主题重建引导的框架,将主题保持视为显式的身份确认问题,其核心前提是:能够忠实保存主题的记忆库应支持仅凭记忆重建该主题。具体而言,Memento联合训练自回归的下一镜头生成与基于记忆的主题重建,利用历史记忆和全局故事描述恢复目标外观。为分离长程与短程线索中的主题证据,Memento引入双查询记忆机制:一个查询检索身份相关记忆,另一个选择短上下文关键帧以实现连贯延续。此外,主题感知的电影数据管道通过一致且无代词的主题描述提供精确重建监督。实验表明,Memento在长期主题一致性、跨镜头连贯性和视觉质量方面均达到最优性能。