Consistent video generation under editing operations requires persistence: when edits modify scene appearance or layout, subsequent generations should remain coherent across time and viewpoints. However, existing memory designs struggle to maintain long-term consistency after such modifications, as stored contexts may become outdated or invalid. To address this, we propose PermaVid, a novel framework built upon a multi-modal context memory that disentangles spatial context into semantic appearance and geometric structure, together with an edit-aware memory update and retrieval strategy that keeps memory evolution aligned with subsequent observations. Specifically, we develop two complementary memory banks: an RGB context memory that captures appearance-aware observations while implicitly encoding geometry, and a depth context memory that preserves geometry-only structure disentangled from semantics. Building on this design, we introduce a memory-guided video generation model that performs multi-modal feature fusion under reference conditions drawn from mixed-modality memory contexts. Experiments demonstrate that our method maintains strong long-term semantic and structural consistency after edits, significantly outperforming state-of-the-art methods.
翻译:在编辑操作下生成一致的视频需要持续性:当编辑修改场景外观或布局时,后续生成的内容应在时间上和视角上保持连贯。然而,现有记忆设计在应对此类修改后难以维持长期一致性,因为存储的上下文可能已过时或失效。为解决此问题,我们提出PermaVid——一种基于多模态上下文记忆的新型框架,该框架将空间上下文解耦为语义外观和几何结构,并配合编辑感知的记忆更新与检索策略,使记忆演化与后续观测保持一致。具体而言,我们开发了两个互补的记忆库:RGB上下文记忆捕获外观感知的观测信息并隐式编码几何,深度上下文记忆则保留独立于语义的纯几何结构。基于此设计,我们引入一种记忆引导的视频生成模型,该模型在从混合模态记忆上下文提取的参考条件下执行多模态特征融合。实验表明,本方法在编辑后仍能保持强大的长期语义与结构一致性,显著优于现有最先进方法。