In human-centric content generation, the pre-trained text-to-image models struggle to produce user-wanted portrait images, which retain the identity of individuals while exhibiting diverse expressions. This paper introduces our efforts towards personalized face generation. To this end, we propose a novel multi-modal face generation framework, capable of simultaneous identity-expression control and more fine-grained expression synthesis. Our expression control is so sophisticated that it can be specialized by the fine-grained emotional vocabulary. We devise a novel diffusion model that can undertake the task of simultaneously face swapping and reenactment. Due to the entanglement of identity and expression, it's nontrivial to separately and precisely control them in one framework, thus has not been explored yet. To overcome this, we propose several innovative designs in the conditional diffusion model, including balancing identity and expression encoder, improved midpoint sampling, and explicitly background conditioning. Extensive experiments have demonstrated the controllability and scalability of the proposed framework, in comparison with state-of-the-art text-to-image, face swapping, and face reenactment methods.
翻译:在以人为中心的内容生成中,预训练的文本到图像模型难以生成用户所需的人像图像,这些图像需在保留个体身份的同时展现多样化的表情。本文介绍了我们在个性化人脸生成方面的研究进展。为此,我们提出了一种新颖的多模态人脸生成框架,能够实现身份与表情的同步控制以及更细粒度的表情合成。我们的表情控制功能极为精细,可通过细粒度的情感词汇进行特化。我们设计了一种新颖的扩散模型,可同时承担人脸交换与表情再现的任务。由于身份与表情之间存在耦合,在同一框架中分别精确控制二者颇具挑战,因此该问题此前尚未得到探索。为克服这一难题,我们在条件扩散模型中提出了多项创新设计,包括平衡身份与表情编码器、改进中点采样以及显式背景条件控制。大量实验表明,与最先进的文本到图像、人脸交换和人脸再现方法相比,所提出框架具有良好的可控性与可扩展性。