The ability to generate diverse 3D articulated head avatars is vital to a plethora of applications, including augmented reality, cinematography, and education. Recent work on text-guided 3D object generation has shown great promise in addressing these needs. These methods directly leverage pre-trained 2D text-to-image diffusion models to generate 3D-multi-view-consistent radiance fields of generic objects. However, due to the lack of geometry and texture priors, these methods have limited control over the generated 3D objects, making it difficult to operate inside a specific domain, e.g., human heads. In this work, we develop a new approach to text-guided 3D head avatar generation to address this limitation. Our framework directly operates on the geometry and texture of an articulable 3D morphable model (3DMM) of a head, and introduces novel optimization procedures to update the geometry and texture while keeping the 2D and 3D facial features aligned. The result is a 3D head avatar that is consistent with the text description and can be readily articulated using the deformation model of the 3DMM. We show that our diffusion-based articulated head avatars outperform state-of-the-art approaches for this task. The latter are typically based on CLIP, which is known to provide limited diversity of generation and accuracy for 3D object generation.
翻译:生成多样化的三维姿态可控头部虚拟形象能力对于增强现实、影视制作及教育等诸多应用至关重要。近期文本引导的三维物体生成研究在满足这些需求方面展现出巨大潜力。这类方法直接利用预训练的二维文本到图像扩散模型,为通用物体生成多视图一致的辐射场。然而,由于缺乏几何与纹理先验,现有方法对生成三维物体的控制能力有限,难以在特定领域(如人类头部)中应用。为解决这一局限,本研究提出了一种文本引导的三维头部虚拟形象生成新方法。该框架直接对具有可变形能力的三维形变模型(3DMM)的几何与纹理进行操作,并引入新型优化流程,在保持二维与三维人脸特征对齐的同时更新几何与纹理。最终生成的三维头部虚拟形象与文本描述高度一致,并能通过3DMM的形变模型进行灵活驱动。实验表明,基于扩散模型的姿态可控头部虚拟形象在该任务中优于当前主流方法,后者通常基于CLIP方法——该方法在三维物体生成中已知存在多样性不足与准确性受限的问题。