ReCap: Lightweight Referential Grounding for Coherent Story Visualization

Story Visualization aims to generate a sequence of images that faithfully depicts a textual narrative that preserve character identity, spatial configuration, and stylistic coherence as the narratives unfold. Maintaining such cross-frame consistency has traditionally relied on explicit memory banks, architectural expansion, or auxiliary language models, resulting in substantial parameter growth and inference overhead. We introduce ReCap, a lightweight consistency framework that improves character stability and visual fidelity without modifying the base diffusion backbone. ReCap's CORE (COnditional frame REferencing) module treats anaphors, in our case pronouns, as visual anchors, activating only when characters are referred to by a pronoun and conditioning on the preceding frame to propagate visual identity. This selective design avoids unconditional cross-frame conditioning and introduces only 149K additional parameters, a fraction of the cost of memory-bank and LLM-augmented approaches. To further stabilize identity, we incorporate SemDrift (Guided Semantic Drift Correction) applied only during training. When text is vague or referential, the denoiser lacks a visual anchor for identity-defining attributes, causing character appearance to drift across frames, SemDrift corrects this by aligning denoiser representations with pretrained DINOv3 visual embeddings, enforcing semantic identity stability at zero inference cost. ReCap outperforms previous state-of-the-art, StoryGPT-V, on the two main benchmarks for story visualization by 2.63% Character-Accuracy on FlintstonesSV and by 5.65% on PororoSV, establishing a new state-of-the-art character consistency on both benchmarks. Furthermore, we extend story visualization to human-centric narratives derived from real films, demonstrating the capability of ReCap beyond stylized cartoon domains.

翻译：摘要：故事可视化旨在生成一系列图像，以忠实呈现文本叙事，并在叙事展开过程中保持角色身份、空间布局和风格连贯性。传统方法依赖显式记忆库、架构扩展或辅助语言模型来维持跨帧一致性，导致参数规模显著增长和推理开销增加。本文提出ReCap——一种轻量级一致性框架，无需修改基础扩散主干网络即可提升角色稳定性与视觉保真度。其CORE（条件性帧引用）模块将回指词（本文中指代词）作为视觉锚点，仅当角色被代词引用时激活，并通过前一帧的条件约束传播视觉身份。这种选择性设计避免了无条件的跨帧约束，仅引入14.9万个额外参数，成本仅为记忆库与大语言模型增强方法的极小部分。为进一步稳定角色身份，我们仅在训练阶段引入SemDrift（引导式语义漂移校正）。当文本模糊或存在指代关系时，去噪器缺乏用于身份属性定义的视觉锚点，导致角色外观在帧间漂移。SemDrift通过将去噪器表示与预训练DINOv3视觉嵌入对齐来纠正此问题，在零推理成本下强制实现语义身份稳定性。ReCap在故事可视化两大主流基准测试中超越此前最先进方法StoryGPT-V：在FlintstonesSV上角色准确率提升2.63%，在PororoSV上提升5.65%，创下两项基准的角色一致性新纪录。此外，我们将故事可视化扩展至源自真实电影的人类中心叙事，展示了ReCap超越风格化卡通领域的适用能力。