Recent advancements in large scale text-to-image models have opened new possibilities for guiding the creation of images through human-devised natural language. However, while prior literature has primarily focused on the generation of individual images, it is essential to consider the capability of these models to ensure coherency within a sequence of images to fulfill the demands of real-world applications such as storytelling. To address this, here we present a novel neural pipeline for generating a coherent storybook from the plain text of a story. Specifically, we leverage a combination of a pre-trained Large Language Model and a text-guided Latent Diffusion Model to generate coherent images. While previous story synthesis frameworks typically require a large-scale text-to-image model trained on expensive image-caption pairs to maintain the coherency, we employ simple textual inversion techniques along with detector-based semantic image editing which allows zero-shot generation of the coherent storybook. Experimental results show that our proposed method outperforms state-of-the-art image editing baselines.
翻译:近年来,大规模文本到图像模型的进步为通过人类设计的自然语言引导图像创作开辟了新可能性。然而,先前文献主要关注单张图像的生成,但为了满足叙事等实际应用需求,必须考虑这些模型在图像序列中保持连贯性的能力。为此,本文提出了一种新颖的神经流水线,用于从故事的纯文本生成连贯的故事书。具体而言,我们结合了预训练的大语言模型和文本引导的潜扩散模型来生成连贯图像。尽管以往的故事合成框架通常需要在大规模图像-描述对上训练昂贵的文本到图像模型以维持连贯性,我们采用简单的文本反演技术与基于检测器的语义图像编辑,实现了零样本的连贯故事书生成。实验结果表明,我们提出的方法优于最先进的图像编辑基线模型。