Text-to-image diffusion models have achieved remarkable success, yet generating coherent image sequences for visual storytelling remains challenging. A key challenge is effectively leveraging all previous text-image pairs, referred to as history text-image pairs, which provide contextual information for maintaining consistency across frames. Existing auto-regressive methods condition on all past image-text pairs but require extensive training, while training-free subject-specific approaches ensure consistency but lack adaptability to narrative prompts. To address these limitations, we propose a multi-modal history adapter for text-to-image diffusion models, \textbf{ViSTA}. It consists of (1) a multi-modal history fusion module to extract relevant history features and (2) a history adapter to condition the generation on the extracted relevant features. We also introduce a salient history selection strategy during inference, where the most salient history text-image pair is selected, improving the quality of the conditioning. Furthermore, we propose to employ a Visual Question Answering-based metric TIFA to assess text-image alignment in visual storytelling, providing a more targeted and interpretable assessment of generated images. Evaluated on the StorySalon and FlintStonesSV dataset, our proposed ViSTA model is not only consistent across different frames, but also well-aligned with the narrative text descriptions.
翻译:文本到图像扩散模型已取得显著成功,但在视觉叙事中生成连贯的图像序列仍具挑战。核心挑战在于如何有效利用所有先前的文本-图像对(即历史文本-图像对),这些对为保持跨帧一致性提供了上下文信息。现有自回归方法虽以所有过往图像-文本对为条件,但需要大量训练;而无训练的特定主体方法虽能确保一致性,却缺乏对叙事提示的适应性。为克服这些局限,我们提出一种用于文本到图像扩散模型的多模态历史适配器——\textbf{ViSTA}。它包含(1)用于提取相关历史特征的多模态历史融合模块,以及(2)基于提取的相关特征调节生成过程的历史适配器。我们还在推理阶段引入一种显著历史选择策略,通过选取最显著的历史文本-图像对来提升条件生成的质量。此外,我们提出采用基于视觉问答的评估指标TIFA来衡量视觉叙事中的文本-图像对齐度,从而为生成图像提供更具针对性且可解释的评估。在StorySalon和FlintStonesSV数据集上的实验表明,所提出的ViSTA模型不仅在不同帧间保持一致性,还能与叙事文本描述良好对齐。