Generating 3D faces from textual descriptions has a multitude of applications, such as gaming, movie, and robotics. Recent progresses have demonstrated the success of unconditional 3D face generation and text-to-3D shape generation. However, due to the limited text-3D face data pairs, text-driven 3D face generation remains an open problem. In this paper, we propose a text-guided 3D faces generation method, refer as TG-3DFace, for generating realistic 3D faces using text guidance. Specifically, we adopt an unconditional 3D face generation framework and equip it with text conditions, which learns the text-guided 3D face generation with only text-2D face data. On top of that, we propose two text-to-face cross-modal alignment techniques, including the global contrastive learning and the fine-grained alignment module, to facilitate high semantic consistency between generated 3D faces and input texts. Besides, we present directional classifier guidance during the inference process, which encourages creativity for out-of-domain generations. Compared to the existing methods, TG-3DFace creates more realistic and aesthetically pleasing 3D faces, boosting 9% multi-view consistency (MVIC) over Latent3D. The rendered face images generated by TG-3DFace achieve higher FID and CLIP score than text-to-2D face/image generation models, demonstrating our superiority in generating realistic and semantic-consistent textures.
翻译:从文本描述生成三维人脸在游戏、电影和机器人等领域具有广泛应用。最新进展已证明无条件三维人脸生成和文本到三维形状生成的成功。然而,由于文本-三维人脸数据对的稀缺性,文本驱动的三维人脸生成仍是一个开放性问题。本文提出一种文本引导的三维人脸生成方法TG-3DFace,旨在通过文本引导生成逼真的三维人脸。具体而言,我们采用无条件三维人脸生成框架并为其配备文本条件,仅利用文本-二维人脸数据即可学习文本引导的三维人脸生成。在此基础上,我们提出两种文本-人脸跨模态对齐技术,包括全局对比学习和细粒度对齐模块,以促进生成的三维人脸与输入文本之间保持高语义一致性。此外,我们在推理过程中引入方向性分类器引导,鼓励跨域生成的创造性。与现有方法相比,TG-3DFace能生成更逼真、更美观的三维人脸,在Latent3D基础上将多视图一致性(MVIC)提升9%。由TG-3DFace生成的渲染人脸图像在FID和CLIP得分上均优于文本到二维人脸/图像生成模型,证明了我们在生成逼真且语义一致纹理方面的优越性。