Semantic image synthesis, i.e., generating images from user-provided semantic label maps, is an important conditional image generation task as it allows to control both the content as well as the spatial layout of generated images. Although diffusion models have pushed the state of the art in generative image modeling, the iterative nature of their inference process makes them computationally demanding. Other approaches such as GANs are more efficient as they only need a single feed-forward pass for generation, but the image quality tends to suffer on large and diverse datasets. In this work, we propose a new class of GAN discriminators for semantic image synthesis that generates highly realistic images by exploiting feature backbone networks pre-trained for tasks such as image classification. We also introduce a new generator architecture with better context modeling and using cross-attention to inject noise into latent variables, leading to more diverse generated images. Our model, which we dub DP-SIMS, achieves state-of-the-art results in terms of image quality and consistency with the input label maps on ADE-20K, COCO-Stuff, and Cityscapes, surpassing recent diffusion models while requiring two orders of magnitude less compute for inference.
翻译:语义图像合成,即根据用户提供的语义标签图生成图像,是一项重要的条件图像生成任务,因为它允许控制生成图像的内容和空间布局。尽管扩散模型推动了生成式图像建模的最新进展,但其推理过程的迭代特性使得计算需求较高。其他方法如生成对抗网络(GAN)效率更高,只需单次前向传播即可生成图像,但在大规模和多样化数据集上图像质量往往不佳。本文提出了一类新的用于语义图像合成的GAN判别器,通过利用预训练的图像分类等任务的特征骨干网络,生成高度逼真的图像。我们还引入了一种新的生成器架构,具备更好的上下文建模能力,并使用交叉注意力机制向潜变量注入噪声,从而生成更多样化的图像。我们的模型名为DP-SIMS,在ADE-20K、COCO-Stuff和Cityscapes数据集上,在图像质量及与输入标签图的一致性方面均达到了最新水平,超越了最近的扩散模型,同时推理计算量降低了两个数量级。