Audio-driven talking head animation is a challenging research topic with many real-world applications. Recent works have focused on creating photo-realistic 2D animation, while learning different talking or singing styles remains an open problem. In this paper, we present a new method to generate talking head animation with learnable style references. Given a set of style reference frames, our framework can reconstruct 2D talking head animation based on a single input image and an audio stream. Our method first produces facial landmarks motion from the audio stream and constructs the intermediate style patterns from the style reference images. We then feed both outputs into a style-aware image generator to generate the photo-realistic and fidelity 2D animation. In practice, our framework can extract the style information of a specific character and transfer it to any new static image for talking head animation. The intensive experimental results show that our method achieves better results than recent state-of-the-art approaches qualitatively and quantitatively.
翻译:音频驱动的说话头像动画是一个具有挑战性的研究课题,广泛应用于现实场景。最近的研究重点在于生成逼真的二维动画,而学习不同的说话或歌唱风格仍是一个开放性问题。本文提出了一种新方法,通过可学习的风格参考来生成说话头像动画。给定一组风格参考帧,我们的框架能够基于单张输入图像和音频流重建二维说话头像动画。该方法首先从音频流生成面部地标运动,并从风格参考图像构建中间风格模式。随后,我们将两者输入到风格感知图像生成器中,生成逼真且保真的二维动画。在实际应用中,我们的框架可以提取特定角色的风格信息,并将其迁移到任何新的静态图像上以生成说话头像动画。大量实验结果表明,我们的方法在定性和定量上均优于近期最先进的方法。