Synthesizing high-quality 3D face models from natural language descriptions is very valuable for many applications, including avatar creation, virtual reality, and telepresence. However, little research ever tapped into this task. We argue the major obstacle lies in 1) the lack of high-quality 3D face data with descriptive text annotation, and 2) the complex mapping relationship between descriptive language space and shape/appearance space. To solve these problems, we build Describe3D dataset, the first large-scale dataset with fine-grained text descriptions for text-to-3D face generation task. Then we propose a two-stage framework to first generate a 3D face that matches the concrete descriptions, then optimize the parameters in the 3D shape and texture space with abstract description to refine the 3D face model. Extensive experimental results show that our method can produce a faithful 3D face that conforms to the input descriptions with higher accuracy and quality than previous methods. The code and Describe3D dataset are released at https://github.com/zhuhao-nju/describe3d .
翻译:从自然语言描述中合成高质量的三维人脸模型对许多应用极具价值,包括虚拟化身创建、虚拟现实和远程临场。然而,目前鲜有研究涉足该任务。我们认为主要障碍在于:1)缺乏带有描述性文本标注的高质量三维人脸数据;2)描述性语言空间与形状/外观空间之间存在复杂的映射关系。为解决这些问题,我们构建了Describe3D数据集——首个为文本到三维人脸生成任务提供细粒度文本描述的大规模数据集。随后,我们提出一个两阶段框架:首先生成符合具体描述的三维人脸,然后通过抽象描述优化三维形状与纹理空间参数以精化人脸模型。大量实验结果表明,我们的方法能生成符合输入描述的高保真三维人脸,其准确性与质量均优于现有方法。代码与Describe3D数据集已在https://github.com/zhuhao-nju/describe3d 开源。