Recently, text-guided 3D generative methods have made remarkable advancements in producing high-quality textures and geometry, capitalizing on the proliferation of large vision-language and image diffusion models. However, existing methods still struggle to create high-fidelity 3D head avatars in two aspects: (1) They rely mostly on a pre-trained text-to-image diffusion model whilst missing the necessary 3D awareness and head priors. This makes them prone to inconsistency and geometric distortions in the generated avatars. (2) They fall short in fine-grained editing. This is primarily due to the inherited limitations from the pre-trained 2D image diffusion models, which become more pronounced when it comes to 3D head avatars. In this work, we address these challenges by introducing a versatile coarse-to-fine pipeline dubbed HeadSculpt for crafting (i.e., generating and editing) 3D head avatars from textual prompts. Specifically, we first equip the diffusion model with 3D awareness by leveraging landmark-based control and a learned textual embedding representing the back view appearance of heads, enabling 3D-consistent head avatar generations. We further propose a novel identity-aware editing score distillation strategy to optimize a textured mesh with a high-resolution differentiable rendering technique. This enables identity preservation while following the editing instruction. We showcase HeadSculpt's superior fidelity and editing capabilities through comprehensive experiments and comparisons with existing methods.
翻译:近期,文本引导的三维生成方法借助大规模视觉-语言模型与图像扩散模型的广泛应用,在高质量纹理与几何生成方面取得了显著进展。然而,现有方法在创建高保真三维头部化身时仍面临两大挑战:(1)主要依赖预训练的文本到图像扩散模型,缺乏必要的三维感知与头部先验知识,导致生成的化身存在不一致性与几何畸变;(2)在细粒度编辑方面表现不足,这主要源于预训练二维图像扩散模型的固有局限性,该问题在处理三维头部化身时尤为突出。为应对上述挑战,本文提出一种名为HeadSculpt的通用粗到细流水线,用于根据文本提示创作(即生成与编辑)三维头部化身。具体而言,我们首先通过基于地标点的控制及学习表征头部背面外观的文本嵌入,赋予扩散模型三维感知能力,从而实现三维一致性的头部化身生成。进而提出一种新颖的感知身份编辑得分蒸馏策略,结合高分辨率可微分渲染技术优化纹理网格,使模型在遵循编辑指令的同时保持身份特征。通过全面实验与现有方法的对比,验证了HeadSculpt在保真度与编辑能力上的优越性。