We propose a method for synthesizing edited photo-realistic digital avatars with text instructions. Given a short monocular RGB video and text instructions, our method uses an image-conditioned diffusion model to edit one head image and uses the video stylization method to accomplish the editing of other head images. Through iterative training and update (three times or more), our method synthesizes edited photo-realistic animatable 3D neural head avatars with a deformable neural radiance field head synthesis method. In quantitative and qualitative studies on various subjects, our method outperforms state-of-the-art methods.
翻译:我们提出了一种基于文本指令合成经过编辑的逼真数字虚拟形象的方法。给定一段短暂的正面单目RGB视频及文本指令,本方法采用图像条件扩散模型对单张头部图像进行编辑,并利用视频风格化技术完成其余头部图像的编辑。通过迭代训练与更新(三次或更多次数),本方法结合可变形神经辐射场头部合成技术,生成经过编辑的逼真可动画三维神经头部虚拟形象。针对多种对象的定量与定性研究表明,本方法性能优于现有最先进技术。