VideoMemory：通过记忆整合实现一致视频生成 (VideoMemory: Toward Consistent Video Generation via Memory Integration)

Maintaining consistent characters, props, and environments across multiple shots is a central challenge in narrative video generation. Existing models can produce high-quality short clips but often fail to preserve entity identity and appearance when scenes change or when entities reappear after long temporal gaps. We present VideoMemory, an entity-centric framework that integrates narrative planning with visual generation through a Dynamic Memory Bank. Given a structured script, a multi-agent system decomposes the narrative into shots, retrieves entity representations from memory, and synthesizes keyframes and videos conditioned on these retrieved states. The Dynamic Memory Bank stores explicit visual and semantic descriptors for characters, props, and backgrounds, and is updated after each shot to reflect story-driven changes while preserving identity. This retrieval-update mechanism enables consistent portrayal of entities across distant shots and supports coherent long-form generation. To evaluate this setting, we construct a 54-case multi-shot consistency benchmark covering character-, prop-, and background-persistent scenarios. Extensive experiments show that VideoMemory achieves strong entity-level coherence and high perceptual quality across diverse narrative sequences.

翻译：在叙事视频生成中，保持角色、道具和环境在多镜头间的一致性是一个核心挑战。现有模型能够生成高质量的短视频片段，但在场景变化或实体经过长时间间隔后重新出现时，往往难以保持实体的身份与外观。我们提出VideoMemory，一个以实体为中心的框架，通过动态记忆库将叙事规划与视觉生成相结合。给定结构化剧本，一个多智能体系统将叙事分解为镜头，从记忆中检索实体表征，并基于这些检索到的状态合成关键帧与视频。动态记忆库存储角色、道具和背景的显式视觉与语义描述符，并在每个镜头后更新以反映故事驱动的变化，同时保持身份不变。这种检索-更新机制使得实体能够在远距离镜头间得到一致描绘，并支持连贯的长篇生成。为评估此设定，我们构建了一个包含54个案例的多镜头一致性基准测试，涵盖角色、道具和背景持续存在的场景。大量实验表明，VideoMemory在多样化的叙事序列中实现了强大的实体级连贯性与高感知质量。

相关内容

实体

关注 12

实体（entity）是有可区别性且独立存在的某种事物，但它不需要是物质上的存在。尤其是抽象和法律拟制也通常被视为实体。实体可被看成是一包含有子集的集合。在哲学里，这种集合被称为客体。实体可被使用来指涉某个可能是人、动物、植物或真菌等不会思考的生命、无生命物体或信念等的事物。在这一方面，实体可以被视为一全包的词语。有时，实体被当做本质的广义，不论即指的是否为物质上的存在，如时常会指涉到的无物质形式的实体－语言。更有甚者，实体有时亦指存在或本质本身。在法律上，实体是指能具有权利和义务的事物。这通常是指法人，但也包括自然人。

【AAAI2026】MoFu：用于多主体视频生成的尺度感知调制与傅里叶融合架构

专知会员服务

9+阅读 · 1月3日

【NeurIPS2025】VideoLucy：用于长视频理解的深度记忆回溯机制

专知会员服务

9+阅读 · 2025年10月15日

【CVPR2025】ShotAdapter：基于扩散模型的文本生成多镜头视频方法

专知会员服务

11+阅读 · 2025年5月16日

【CVPR2025】《VideoMage：文本到视频扩散模型的多主体与运动定制》

专知会员服务

12+阅读 · 2025年3月28日