Story visualization presents a challenging task in text-to-image generation, requiring not only the rendering of visual details from text prompt but also ensuring consistency across images. Recently, most approaches address inconsistency problem using an auto-regressive manner conditioned on previous image-sentence pairs. However, they overlook the fact that story context is dispersed across all sentences. The auto-regressive approach fails to encode information from susequent image-sentence pairs, thus unable to capture the entirety of the story context. To address this, we introduce TemporalStory, leveraging Spatial-Temporal attention to model complex spatial and temporal dependencies in images, enabling the generation of coherent images based on a given storyline. In order to better understand the storyline context, we introduce a text adapter capable of integrating information from other sentences into the embedding of the current sentence. Additionally, to utilize scene changes between story images as guidance for the model, we propose the StoryFlow Adapter to measure the degree of change between images. Through extensive experiments on two popular benchmarks, PororoSV and FlintstonesSV, our TemporalStory outperforms the previous state-of-the-art in both story visualization and story continuation tasks.
翻译:故事可视化是文本到图像生成领域的一项挑战性任务,它不仅需要根据文本提示渲染视觉细节,还必须确保图像间的一致性。近期,大多数方法采用以前序图像-句子对为条件的自回归方式来解决不一致性问题。然而,这些方法忽视了故事上下文分散在所有句子中的事实。自回归方法无法编码后续图像-句子对的信息,因而无法捕捉故事上下文的整体性。为解决这一问题,我们提出了TemporalStory,它利用时空注意力来建模图像中复杂的空间与时间依赖关系,从而能够基于给定故事情节生成连贯的图像。为了更好地理解故事情节上下文,我们引入了一个文本适配器,能够将其他句子的信息整合到当前句子的嵌入表示中。此外,为了利用故事图像间的场景变化作为模型的引导,我们提出了StoryFlow适配器来度量图像间的变化程度。通过在PororoSV和FlintstonesSV两个流行基准上进行大量实验,我们的TemporalStory在故事可视化和故事延续任务上均超越了先前的最优方法。