Recent advancements in the domain of text-to-image synthesis have culminated in a multitude of enhancements pertaining to quality, fidelity, and diversity. Contemporary techniques enable the generation of highly intricate visuals which rapidly approach near-photorealistic quality. Nevertheless, as progress is achieved, the complexity of these methodologies increases, consequently intensifying the comprehension barrier between individuals within the field and those external to it. In an endeavor to mitigate this disparity, we propose a streamlined approach for text-to-image generation, which encompasses both the training paradigm and the sampling process. Despite its remarkable simplicity, our method yields aesthetically pleasing images with few sampling iterations, allows for intriguing ways for conditioning the model, and imparts advantages absent in state-of-the-art techniques. To demonstrate the efficacy of this approach in achieving outcomes comparable to existing works, we have trained a one-billion parameter text-conditional model, which we refer to as "Paella". In the interest of fostering future exploration in this field, we have made our source code and models publicly accessible for the research community.
翻译:近期文本到图像合成领域的进展催生了诸多在质量、保真度和多样性方面的提升。当代技术能够生成高度复杂的视觉内容,其质量迅速趋近于近照片级真实感。然而,随着技术突破的实现,这些方法的复杂度也随之增加,从而加剧了领域内人士与外部研究者之间的理解壁垒。为弥合这一差距,我们提出了一种简化的文本到图像生成方法,该方法同时涵盖了训练范式和采样流程。尽管该方法具有显著的简洁性,但只需少量采样迭代即可生成美学上令人满意的图像,支持对模型进行条件控制的多种创新方式,并具备现有最先进技术所缺乏的优势。为验证该方法在实现与现有工作相当成果方面的有效性,我们训练了一个十亿参数的文本条件模型,并将其命名为"Paella"。为促进该领域的未来探索,我们已将源代码和模型公开发布,供研究社区使用。