Portrait synthesis creates realistic digital avatars which enable users to interact with others in a compelling way. Recent advances in StyleGAN and its extensions have shown promising results in synthesizing photorealistic and accurate reconstruction of human faces. However, previous methods often focus on frontal face synthesis and most methods are not able to handle large head rotations due to the training data distribution of StyleGAN. In this work, our goal is to take as input a monocular video of a face, and create an editable dynamic portrait able to handle extreme head poses. The user can create novel viewpoints, edit the appearance, and animate the face. Our method utilizes pivotal tuning inversion (PTI) to learn a personalized video prior from a monocular video sequence. Then we can input pose and expression coefficients to MLPs and manipulate the latent vectors to synthesize different viewpoints and expressions of the subject. We also propose novel loss functions to further disentangle pose and expression in the latent space. Our algorithm shows much better performance over previous approaches on monocular video datasets, and it is also capable of running in real-time at 54 FPS on an RTX 3080.
翻译:摘要:肖像合成技术能够生成逼真的数字化身,使用户以极具吸引力的方式与他人互动。StyleGAN及其扩展的最新进展在合成逼真且精确的人脸重建方面展现出令人瞩目的成果。然而,现有方法通常聚焦于正面人脸合成,且多数方法因StyleGAN训练数据分布的限制,无法处理大幅度的头部旋转。本研究旨在以单目人脸视频为输入,创建能够处理极端头部姿态的可编辑动态肖像。用户可以生成新颖视角、编辑外观并驱动面部动画。本方法利用关键点调优反转(PTI)从单目视频序列中学习个性化视频先验。随后,我们向多层感知机(MLPs)输入姿态与表情系数,通过操控潜在向量合成对象的不同视角与表情。此外,我们提出新颖的损失函数以进一步在潜在空间中解耦姿态与表情。在单目视频数据集上,本算法较现有方法展现出显著更优的性能,且能在RTX 3080上以54 FPS的帧率实时运行。