AvatarStudio: Text-driven Editing of 3D Dynamic Human Head Avatars

Capturing and editing full head performances enables the creation of virtual characters with various applications such as extended reality and media production. The past few years witnessed a steep rise in the photorealism of human head avatars. Such avatars can be controlled through different input data modalities, including RGB, audio, depth, IMUs and others. While these data modalities provide effective means of control, they mostly focus on editing the head movements such as the facial expressions, head pose and/or camera viewpoint. In this paper, we propose AvatarStudio, a text-based method for editing the appearance of a dynamic full head avatar. Our approach builds on existing work to capture dynamic performances of human heads using neural radiance field (NeRF) and edits this representation with a text-to-image diffusion model. Specifically, we introduce an optimization strategy for incorporating multiple keyframes representing different camera viewpoints and time stamps of a video performance into a single diffusion model. Using this personalized diffusion model, we edit the dynamic NeRF by introducing view-and-time-aware Score Distillation Sampling (VT-SDS) following a model-based guidance approach. Our method edits the full head in a canonical space, and then propagates these edits to remaining time steps via a pretrained deformation network. We evaluate our method visually and numerically via a user study, and results show that our method outperforms existing approaches. Our experiments validate the design choices of our method and highlight that our edits are genuine, personalized, as well as 3D- and time-consistent.

翻译：捕捉和编辑完整的头部表演能够创建虚拟角色，应用于扩展现实和媒体制作等多种领域。过去几年中，人类头部头像的光真实感水平显著提升。此类头像可通过不同的输入数据模态进行控制，包括RGB、音频、深度、惯性测量单元等。尽管这些数据模态提供了有效的控制手段，但它们主要侧重于编辑头部动作，如面部表情、头部姿态和/或摄像机视角。在本文中，我们提出阿凡达工作室（AvatarStudio），一种基于文本的方法，用于编辑动态完整头部头像的外观。我们的方法基于现有工作，利用神经辐射场（NeRF）捕捉人类头部的动态表演，并通过文本到图像扩散模型编辑该表示。具体而言，我们引入了一种优化策略，将视频表演中代表不同摄像机视角和时间戳的多个关键帧整合到一个扩散模型中。利用这种个性化扩散模型，我们通过引入视图和时间感知的分数蒸馏采样（VT-SDS），采用基于模型的引导方法来编辑动态NeRF。我们的方法在规范空间中对完整头部进行编辑，然后通过预训练的变形网络将这些编辑传播到其他时间步。我们通过用户研究在视觉和数值上评估了我们的方法，结果表明我们的方法优于现有方法。实验验证了我们方法的设计选择，并突显了编辑的真实性、个性化以及3D和时间一致性。