Recent text-to-image diffusion models are able to generate convincing results of unprecedented quality. However, it is nearly impossible to control the shapes of different regions/objects or their layout in a fine-grained fashion. Previous attempts to provide such controls were hindered by their reliance on a fixed set of labels. To this end, we present SpaText - a new method for text-to-image generation using open-vocabulary scene control. In addition to a global text prompt that describes the entire scene, the user provides a segmentation map where each region of interest is annotated by a free-form natural language description. Due to lack of large-scale datasets that have a detailed textual description for each region in the image, we choose to leverage the current large-scale text-to-image datasets and base our approach on a novel CLIP-based spatio-textual representation, and show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-based. In addition, we show how to extend the classifier-free guidance method in diffusion models to the multi-conditional case and present an alternative accelerated inference algorithm. Finally, we offer several automatic evaluation metrics and use them, in addition to FID scores and a user study, to evaluate our method and show that it achieves state-of-the-art results on image generation with free-form textual scene control.
翻译:近期文本到图像的扩散模型能够生成质量空前的逼真结果。然而,现有方法几乎无法精细控制不同区域/物体的形状或其布局。此前提供此类控制的尝试受限于对固定标签集合的依赖。为此,我们提出SpaText——一种利用开放词汇场景控制进行文本到图像生成的新方法。除描述整个场景的全局文本提示外,用户还可提供分割图,其中每个感兴趣区域由自由形式的自然语言描述标注。由于缺乏对图像各区域配备详细文本描述的大规模数据集,我们选择利用现有大规模文本-图像数据集,基于新颖的CLIP时空文本表示构建方法,并在两种最先进的扩散模型(像素级与潜变量级)上验证其有效性。此外,我们展示了如何将扩散模型中的无分类器引导方法扩展至多条件情况,并提出一种替代性加速推理算法。最后,我们提出多项自动评估指标,结合FID分数与用户研究,评估表明本方法在自由形式文本场景控制的图像生成任务中达到了最先进水平。