Diffusion models developed on top of powerful text-to-image generation models like Stable Diffusion achieve remarkable success in visual story generation. However, the best-performing approach considers historically generated results as flattened memory cells, ignoring the fact that not all preceding images contribute equally to the generation of the characters and scenes at the current stage. To address this, we present a simple method that improves the leading system with adaptive context modeling, which is not only incorporated in the encoder but also adopted as additional guidance in the sampling stage to boost the global consistency of the generated story. We evaluate our model on PororoSV and FlintstonesSV datasets and show that our approach achieves state-of-the-art FID scores on both story visualization and continuation scenarios. We conduct detailed model analysis and show that our model excels at generating semantically consistent images for stories.
翻译:基于强大的文本到图像生成模型(如Stable Diffusion)开发的扩散模型在视觉故事生成领域取得了显著成功。然而,现有最佳方法将历史生成结果视为扁平化记忆单元,忽略了并非所有先前图像对当前阶段角色与场景的生成具有同等贡献这一事实。针对这一问题,我们提出了一种简洁方法,通过自适应上下文建模改进当前领先系统:该机制不仅集成于编码器中,还在采样阶段作为额外引导,以增强生成故事的全局一致性。我们在PororoSV与FlintstonesSV数据集上的评估表明,本方法在故事可视化与延续生成任务中均实现了最优FID分数。通过详细的模型分析,我们验证了本模型在生成语义一致的连贯故事图像方面的优势。