Story visualization (SV) is a challenging text-to-image generation task for the difficulty of not only rendering visual details from the text descriptions but also encoding a long-term context across multiple sentences. While prior efforts mostly focus on generating a semantically relevant image for each sentence, encoding a context spread across the given paragraph to generate contextually convincing images (e.g., with a correct character or with a proper background of the scene) remains a challenge. To this end, we propose a novel memory architecture for the Bi-directional Transformers with an online text augmentation that generates multiple pseudo-descriptions as supplementary supervision during training, for better generalization to the language variation at inference. In extensive experiments on the two popular SV benchmarks, i.e., the Pororo-SV and Flintstones-SV, the proposed method significantly outperforms the state of the arts in various evaluation metrics including FID, character F1, frame accuracy, BLEU-2/3, and R-precision with similar or less computational complexity.
翻译:故事可视化是一项具有挑战性的文本到图像生成任务,其难点不仅在于从文本描述中渲染视觉细节,还在于跨多个句子编码长期上下文。虽然以往的研究大多集中于为每个句子生成语义相关的图像,但如何对给定段落中分布的上下文进行编码以生成上下文连贯的图像(例如,包含正确的角色或合适的场景背景)仍是一个挑战。为此,我们提出了一种新颖的面向双向Transformer的记忆架构,该架构结合在线文本增强机制,在训练过程中生成多个伪描述作为补充监督,从而在推理时更好地泛化语言变体。在两个主流的故事可视化基准数据集(即Pororo-SV和Flintstones-SV)上进行的大量实验表明,所提出的方法在各项评估指标上均显著优于现有技术,包括FID、角色F1分数、帧准确率、BLEU-2/3和R精度,同时保持相似或更低的计算复杂度。