Maintaining consistent characters, props, and environments across multiple shots is a central challenge in narrative video generation. Existing models can produce high-quality short clips but often fail to preserve entity identity and appearance when scenes change or when entities reappear after long temporal gaps. We present VideoMemory, an entity-centric framework that integrates narrative planning with visual generation through a Dynamic Memory Bank. Given a structured script, a multi-agent system decomposes the narrative into shots, retrieves entity representations from memory, and synthesizes keyframes and videos conditioned on these retrieved states. The Dynamic Memory Bank stores explicit visual and semantic descriptors for characters, props, and backgrounds, and is updated after each shot to reflect story-driven changes while preserving identity. This retrieval-update mechanism enables consistent portrayal of entities across distant shots and supports coherent long-form generation. To evaluate this setting, we construct a 54-case multi-shot consistency benchmark covering character-, prop-, and background-persistent scenarios. Extensive experiments show that VideoMemory achieves strong entity-level coherence and high perceptual quality across diverse narrative sequences.
翻译:在叙事视频生成中,保持角色、道具和环境在多镜头间的一致性是一个核心挑战。现有模型能够生成高质量的短视频片段,但在场景变化或实体经过长时间间隔后重新出现时,往往难以保持实体的身份与外观。我们提出VideoMemory,一个以实体为中心的框架,通过动态记忆库将叙事规划与视觉生成相结合。给定结构化剧本,一个多智能体系统将叙事分解为镜头,从记忆中检索实体表征,并基于这些检索到的状态合成关键帧与视频。动态记忆库存储角色、道具和背景的显式视觉与语义描述符,并在每个镜头后更新以反映故事驱动的变化,同时保持身份不变。这种检索-更新机制使得实体能够在远距离镜头间得到一致描绘,并支持连贯的长篇生成。为评估此设定,我们构建了一个包含54个案例的多镜头一致性基准测试,涵盖角色、道具和背景持续存在的场景。大量实验表明,VideoMemory在多样化的叙事序列中实现了强大的实体级连贯性与高感知质量。