Recently, text-guided 3D generative methods have made remarkable advancements in producing high-quality textures and geometry, capitalizing on the proliferation of large vision-language and image diffusion models. However, existing methods still struggle to create high-fidelity 3D head avatars in two aspects: (1) They rely mostly on a pre-trained text-to-image diffusion model whilst missing the necessary 3D awareness and head priors. This makes them prone to inconsistency and geometric distortions in the generated avatars. (2) They fall short in fine-grained editing. This is primarily due to the inherited limitations from the pre-trained 2D image diffusion models, which become more pronounced when it comes to 3D head avatars. In this work, we address these challenges by introducing a versatile coarse-to-fine pipeline dubbed HeadSculpt for crafting (i.e., generating and editing) 3D head avatars from textual prompts. Specifically, we first equip the diffusion model with 3D awareness by leveraging landmark-based control and a learned textual embedding representing the back view appearance of heads, enabling 3D-consistent head avatar generations. We further propose a novel identity-aware editing score distillation strategy to optimize a textured mesh with a high-resolution differentiable rendering technique. This enables identity preservation while following the editing instruction. We showcase HeadSculpt's superior fidelity and editing capabilities through comprehensive experiments and comparisons with existing methods.
翻译:近期,利用大规模视觉-语言模型与图像扩散模型的蓬勃发展,基于文本引导的三维生成方法在高质量纹理与几何建模领域取得了显著进展。然而现有方法在构建高保真度三维头部虚拟化身时仍存在两大挑战:(1)过度依赖预训练的文本到图像扩散模型,缺乏必要的三维感知能力与头部先验知识,导致生成的虚拟化身存在几何不一致与结构扭曲现象。(2)细粒度编辑能力不足。这主要源于预训练二维图像扩散模型的固有局限,该局限在三维头部虚拟化身生成任务中更为突出。为应对上述挑战,本文提出名为HeadSculpt的通用粗到细流水线,实现从文本提示中生成与编辑三维头部虚拟化身。具体而言,我们首先通过基于地标控制技术与代表头部背面外观的学得文本嵌入,赋予扩散模型三维感知能力,从而实现三维一致性头部化身生成。进一步提出新型身份感知编辑分数蒸馏策略,结合高分辨率可微渲染技术优化纹理网格,在遵循编辑指令的同时保持身份特征。通过全面实验与现有方法的对比,我们展示了HeadSculpt在保真度与编辑能力上的优越性。