This paper introduces Story-Iter, a new training-free iterative paradigm to enhance long-story generation. Unlike existing methods that rely on fixed reference images to construct a complete story, our approach features a novel external iterative paradigm, extending beyond the internal iterative denoising steps of diffusion models, to continuously refine each generated image by incorporating all reference images from the previous round. To achieve this, we propose a plug-and-play, training-free global reference cross-attention (GRCA) module, modeling all reference frames with global embeddings, ensuring semantic consistency in long sequences. By progressively incorporating holistic visual context and text constraints, our iterative paradigm enables precise generation with fine-grained interactions, optimizing the story visualization step-by-step. Extensive experiments in the official story visualization dataset and our long story benchmark demonstrate that Story-Iter's state-of-the-art performance in long-story visualization (up to 100 frames) excels in both semantic consistency and fine-grained interactions.
翻译:本文介绍了Story-Iter,一种用于增强长故事生成的免训练迭代新范式。与现有方法依赖固定参考图像来构建完整故事不同,我们的方法采用了一种新颖的外部迭代范式,它超越了扩散模型内部的迭代去噪步骤,通过整合前一轮生成的所有参考图像来持续优化每一张生成的图像。为此,我们提出了一种即插即用、无需训练的全局参考交叉注意力模块,该模块使用全局嵌入对所有参考帧进行建模,确保了长序列中的语义一致性。通过逐步融入整体视觉上下文和文本约束,我们的迭代范式能够实现具有细粒度交互的精确生成,从而逐步优化故事可视化。在官方故事可视化数据集及我们构建的长故事基准上的大量实验表明,Story-Iter在长故事可视化任务中取得了最先进的性能,在长达100帧的序列中,其在语义一致性和细粒度交互方面均表现优异。