Text-Guided Generation and Editing of Compositional 3D Avatars

Our goal is to create a realistic 3D facial avatar with hair and accessories using only a text description. While this challenge has attracted significant recent interest, existing methods either lack realism, produce unrealistic shapes, or do not support editing, such as modifications to the hairstyle. We argue that existing methods are limited because they employ a monolithic modeling approach, using a single representation for the head, face, hair, and accessories. Our observation is that the hair and face, for example, have very different structural qualities that benefit from different representations. Building on this insight, we generate avatars with a compositional model, in which the head, face, and upper body are represented with traditional 3D meshes, and the hair, clothing, and accessories with neural radiance fields (NeRF). The model-based mesh representation provides a strong geometric prior for the face region, improving realism while enabling editing of the person's appearance. By using NeRFs to represent the remaining components, our method is able to model and synthesize parts with complex geometry and appearance, such as curly hair and fluffy scarves. Our novel system synthesizes these high-quality compositional avatars from text descriptions. The experimental results demonstrate that our method, Text-guided generation and Editing of Compositional Avatars (TECA), produces avatars that are more realistic than those of recent methods while being editable because of their compositional nature. For example, our TECA enables the seamless transfer of compositional features like hairstyles, scarves, and other accessories between avatars. This capability supports applications such as virtual try-on.

翻译：我们的目标是仅通过文本描述生成一个带有头发和配饰的逼真三维面部虚拟形象。尽管这一挑战近期吸引了大量研究兴趣，但现有方法要么缺乏真实感，要么产生不自然的形状，要么不支持编辑（例如对发型的修改）。我们认为现有方法的局限性在于它们采用了整体建模方式，使用单一表示来表征头部、面部、头发和配饰。我们的观察是，例如头发和面部具有截然不同的结构特性，需要采用不同的表示方法。基于这一洞见，我们采用组合式模型生成虚拟形象：头部、面部和上半身用传统三维网格表示，而头发、衣物和配饰则用神经辐射场（NeRF）表示。基于模型的网格表示为面部区域提供了强大的几何先验，既能提升真实感，又能支持人物外观的编辑。通过用NeRF表示其余组件，我们的方法能够建模和合成具有复杂几何与外观的部件，例如卷发和蓬松围巾。我们提出的新系统能从文本描述合成这些高质量的组合式虚拟形象。实验结果表明，本方法——文本引导的组合式虚拟形象生成与编辑（TECA）——生成的虚拟形象比近期方法更逼真，且因其组合式特性而支持编辑。例如，我们的TECA能够在虚拟形象之间无缝迁移发型、围巾等组合式特征，从而支持虚拟试穿等应用。