Generating photorealistic 3D faces from given conditions is a challenging task. Existing methods often rely on time-consuming one-by-one optimization approaches, which are not efficient for modeling the same distribution content, e.g., faces. Additionally, an ideal controllable 3D face generation model should consider both facial attributes and expressions. Thus we propose a novel approach called TEx-Face(TExt & Expression-to-Face) that addresses these challenges by dividing the task into three components, i.e., 3D GAN Inversion, Conditional Style Code Diffusion, and 3D Face Decoding. For 3D GAN inversion, we introduce two methods which aim to enhance the representation of style codes and alleviate 3D inconsistencies. Furthermore, we design a style code denoiser to incorporate multiple conditions into the style code and propose a data augmentation strategy to address the issue of insufficient paired visual-language data. Extensive experiments conducted on FFHQ, CelebA-HQ, and CelebA-Dialog demonstrate the promising performance of our TEx-Face in achieving the efficient and controllable generation of photorealistic 3D faces. The code will be available at https://github.com/sxl142/TEx-Face.
翻译:从给定条件生成逼真的三维人脸是一项具有挑战性的任务。现有方法通常依赖耗时的逐样本优化方式,难以高效建模同类分布内容(如人脸)。此外,理想的可控三维人脸生成模型需同时兼顾面部属性与表情。为此,我们提出一种名为TEx-Face(文本与表情至人脸)的新方法,通过将任务分解为三个组件(即三维生成对抗网络反演、条件风格码扩散、三维人脸解码)来应对上述挑战。针对三维生成对抗网络反演,我们引入两种方法以增强风格码表征并缓解三维不一致性问题。进一步地,我们设计了风格码去噪器以将多种条件融入风格码,并提出数据增强策略解决视觉-语言配对数据不足的问题。在FFHQ、CelebA-HQ和CelebA-Dialog数据集上的大量实验表明,TEx-Face在实现高效可控的逼真三维人脸生成方面表现出色。代码将发布于https://github.com/sxl142/TEx-Face。