AvatarStudio: Text-driven Editing of 3D Dynamic Human Head Avatars

Capturing and editing full head performances enables the creation of virtual characters with various applications such as extended reality and media production. The past few years witnessed a steep rise in the photorealism of human head avatars. Such avatars can be controlled through different input data modalities, including RGB, audio, depth, IMUs and others. While these data modalities provide effective means of control, they mostly focus on editing the head movements such as the facial expressions, head pose and/or camera viewpoint. In this paper, we propose AvatarStudio, a text-based method for editing the appearance of a dynamic full head avatar. Our approach builds on existing work to capture dynamic performances of human heads using neural radiance field (NeRF) and edits this representation with a text-to-image diffusion model. Specifically, we introduce an optimization strategy for incorporating multiple keyframes representing different camera viewpoints and time stamps of a video performance into a single diffusion model. Using this personalized diffusion model, we edit the dynamic NeRF by introducing view-and-time-aware Score Distillation Sampling (VT-SDS) following a model-based guidance approach. Our method edits the full head in a canonical space, and then propagates these edits to remaining time steps via a pretrained deformation network. We evaluate our method visually and numerically via a user study, and results show that our method outperforms existing approaches. Our experiments validate the design choices of our method and highlight that our edits are genuine, personalized, as well as 3D- and time-consistent.

翻译：捕捉并编辑完整头部表演能够创建用于扩展现实和媒体制作等各类应用的虚拟角色。过去几年中，人类头部化身的光照真实感呈现显著提升。此类化身可通过多种输入数据模态进行控制，包括RGB、音频、深度、惯性测量单元等。尽管这些数据模态提供了有效的控制手段，但主要聚焦于面部表情、头部姿态及/或摄像机视角等头部运动的编辑。本文提出AvatarStudio——一种基于文本的动态完整头部化身外观编辑方法。该方法基于现有研究，利用神经辐射场（NeRF）捕捉人类头部的动态表演，并通过文本到图像扩散模型对该表征进行编辑。具体而言，我们引入了一种优化策略，将视频表演中不同摄像机视角和时间戳的多帧关键帧整合到单一扩散模型中。利用该个性化扩散模型，我们通过引入视图与时间感知的分数蒸馏采样（VT-SDS），采用基于模型的引导方法对动态NeRF进行编辑。本方法在规范空间中对完整头部进行编辑，并通过预训练的变形网络将编辑结果传播至其余时间步。我们通过用户研究对方法进行了视觉与数值评估，结果表明我们的方法优于现有方案。实验验证了方法的设计选择，并凸显了编辑结果的真实性、个性化特性，以及时空一致性。