StyleAvatar: Real-time Photo-realistic Portrait Avatar from a Single Video

Face reenactment methods attempt to restore and re-animate portrait videos as realistically as possible. Existing methods face a dilemma in quality versus controllability: 2D GAN-based methods achieve higher image quality but suffer in fine-grained control of facial attributes compared with 3D counterparts. In this work, we propose StyleAvatar, a real-time photo-realistic portrait avatar reconstruction method using StyleGAN-based networks, which can generate high-fidelity portrait avatars with faithful expression control. We expand the capabilities of StyleGAN by introducing a compositional representation and a sliding window augmentation method, which enable faster convergence and improve translation generalization. Specifically, we divide the portrait scenes into three parts for adaptive adjustments: facial region, non-facial foreground region, and the background. Besides, our network leverages the best of UNet, StyleGAN and time coding for video learning, which enables high-quality video generation. Furthermore, a sliding window augmentation method together with a pre-training strategy are proposed to improve translation generalization and training performance, respectively. The proposed network can converge within two hours while ensuring high image quality and a forward rendering time of only 20 milliseconds. Furthermore, we propose a real-time live system, which further pushes research into applications. Results and experiments demonstrate the superiority of our method in terms of image quality, full portrait video generation, and real-time re-animation compared to existing facial reenactment methods. Training and inference code for this paper are at https://github.com/LizhenWangT/StyleAvatar.

翻译：面部重演方法致力于尽可能真实地恢复和重新驱动肖像视频。现有方法在质量与可控性之间面临两难困境：基于2D GAN的方法虽能获得更高图像质量，但在面部属性的细粒度控制上不及3D方法。本文提出StyleAvatar，一种基于StyleGAN网络的实时逼真肖像化身重建方法，可在忠实表达控制下生成高保真肖像化身。我们通过引入组合式表示和滑动窗口增强方法扩展了StyleGAN的能力，从而加速收敛并提升平移泛化性能。具体而言，我们将肖像场景划分为三个区域进行自适应调整：面部区域、非面部前景区域和背景区域。此外，我们的网络结合了UNet、StyleGAN和时间编码的优势进行视频学习，可实现高质量视频生成。同时，提出滑动窗口增强方法和预训练策略，分别用于改进平移泛化和训练性能。所提网络可在两小时内收敛，同时保证高图像质量且前向渲染时间仅需20毫秒。更进一步，我们提出实时直播系统，推动研究成果向应用落地。实验结果证明，与现有面部重演方法相比，本方法在图像质量、全肖像视频生成和实时重新驱动方面均具有优越性。本文训练与推理代码位于https://github.com/LizhenWangT/StyleAvatar。