We present a new multi-modal face image generation method that converts a text prompt and a visual input, such as a semantic mask or scribble map, into a photo-realistic face image. To do this, we combine the strengths of Generative Adversarial networks (GANs) and diffusion models (DMs) by employing the multi-modal features in the DM into the latent space of the pre-trained GANs. We present a simple mapping and a style modulation network to link two models and convert meaningful representations in feature maps and attention maps into latent codes. With GAN inversion, the estimated latent codes can be used to generate 2D or 3D-aware facial images. We further present a multi-step training strategy that reflects textual and structural representations into the generated image. Our proposed network produces realistic 2D, multi-view, and stylized face images, which align well with inputs. We validate our method by using pre-trained 2D and 3D GANs, and our results outperform existing methods. Our project page is available at https://github.com/1211sh/Diffusion-driven_GAN-Inversion/.
翻译:我们提出了一种新的多模态人脸图像生成方法,该方法将文本提示和视觉输入(如语义掩码或涂鸦图)转换为逼真的人脸图像。为此,我们结合了生成对抗网络(GANs)和扩散模型(DMs)的优势,将DM中的多模态特征引入预训练GAN的潜在空间中。我们提出了一种简单的映射和风格调制网络,用于连接两种模型,并将特征图和注意力图中的有意义的表示转换为潜在编码。通过GAN反演,估计的潜在编码可用于生成2D或3A感知的人脸图像。我们进一步提出了一种多步训练策略,将文本和结构表示反映到生成的图像中。我们提出的网络能够生成与输入良好对齐的逼真的2D、多视角和风格化人脸图像。我们使用预训练的2D和3D GAN验证了我们的方法,其结果优于现有方法。我们的项目页面可在 https://github.com/1211sh/Diffusion-driven_GAN-Inversion/ 访问。