Generative models have recently exhibited exceptional capabilities in various scenarios, for example, image generation based on text description. In this work, we focus on the task of generating a series of coherent image sequence based on a given storyline, denoted as open-ended visual storytelling. We make the following three contributions: (i) to fulfill the task of visual storytelling, we introduce two modules into a pre-trained stable diffusion model, and construct an auto-regressive image generator, termed as StoryGen, that enables to generate the current frame by conditioning on both a text prompt and a preceding frame; (ii) to train our proposed model, we collect paired image and text samples by sourcing from various online sources, such as videos, E-books, and establish a data processing pipeline for constructing a diverse dataset, named StorySalon, with a far larger vocabulary than existing animation-specific datasets; (iii) we adopt a three-stage curriculum training strategy, that enables style transfer, visual context conditioning, and human feedback alignment, respectively. Quantitative experiments and human evaluation have validated the superiority of our proposed model, in terms of image quality, style consistency, content consistency, and visual-language alignment. We will make the code, model, and dataset publicly available to the research community.
翻译:生成模型近年来在多种场景下展现出卓越能力,例如基于文本描述生成图像。本研究聚焦于根据给定故事情节生成连贯图像序列的任务,即开放式视觉叙事。我们做出以下三点贡献:(i) 为完成视觉叙事任务,我们在预训练的稳定扩散模型中引入两个模块,构建自回归图像生成器(命名为StoryGen),通过同时依赖文本提示与前一帧图像生成当前帧;(ii) 为训练所提模型,我们从在线数据源(如视频、电子书)中采集配对图像与文本样本,构建数据处理流水线,形成多样化数据集StorySalon,其词汇量远超现有动画专属数据集;(iii) 我们采用三阶段课程训练策略,分别实现风格迁移、视觉语境条件化与人类反馈对齐。定量实验与人工评估验证了所提模型在图像质量、风格一致性、内容一致性以及视觉语言对齐方面的优越性。我们将向研究社区公开代码、模型与数据集。