Deep generative models have shown impressive results in generating realistic images of faces. GANs managed to generate high-quality, high-fidelity images when conditioned on semantic masks, but they still lack the ability to diversify their output. Diffusion models partially solve this problem and are able to generate diverse samples given the same condition. In this paper, we propose a multi-conditioning approach for diffusion models via cross-attention exploiting both attributes and semantic masks to generate high-quality and controllable face images. We also studied the impact of applying perceptual-focused loss weighting into the latent space instead of the pixel space. Our method extends the previous approaches by introducing conditioning on more than one set of features, guaranteeing a more fine-grained control over the generated face images. We evaluate our approach on the CelebA-HQ dataset, and we show that it can generate realistic and diverse samples while allowing for fine-grained control over multiple attributes and semantic regions. Additionally, we perform an ablation study to evaluate the impact of different conditioning strategies on the quality and diversity of the generated images.
翻译:深度生成模型在生成逼真人脸图像方面已展现出引人注目的成果。生成对抗网络在条件化语义掩膜时能够生成高质量、高保真度的图像,但缺乏输出多样化的能力。扩散模型部分解决了这一问题,能够在相同条件下生成多样化样本。本文提出一种基于交叉注意力的多条件扩散模型方法,通过同时利用属性和语义掩膜生成高质量且可控的人脸图像。我们还研究了将感知聚焦损失加权应用于潜在空间而非像素空间的影响。本方法通过引入对多组特征的条件控制,扩展了先前的工作,确保对人脸生成图像实现更精细的调控。我们在CelebA-HQ数据集上评估了该方法,结果表明其既能生成逼真且多样化的样本,又能对多个属性和语义区域进行精细调控。此外,我们通过消融实验评估了不同条件化策略对生成图像质量与多样性的影响。