Audio-driven talking head animation is a challenging research topic with many real-world applications. Recent works have focused on creating photo-realistic 2D animation, while learning different talking or singing styles remains an open problem. In this paper, we present a new method to generate talking head animation with learnable style references. Given a set of style reference frames, our framework can reconstruct 2D talking head animation based on a single input image and an audio stream. Our method first produces facial landmarks motion from the audio stream and constructs the intermediate style patterns from the style reference images. We then feed both outputs into a style-aware image generator to generate the photo-realistic and fidelity 2D animation. In practice, our framework can extract the style information of a specific character and transfer it to any new static image for talking head animation. The intensive experimental results show that our method achieves better results than recent state-of-the-art approaches qualitatively and quantitatively.
翻译:音频驱动的说话头部动画是一项具有广泛应用前景的挑战性研究课题。近年来,相关研究主要聚焦于创建逼真的二维动画,然而,学习不同说话或唱歌风格仍是一个未解决的难题。本文提出一种基于可学习风格参考的说话头部动画生成新方法。给定一组风格参考帧,该框架能够根据单张输入图像和音频流重建二维说话头部动画。该方法首先从音频流中生成面部特征点运动,并从风格参考图像中构建中间风格模式,随后将两者输入至风格感知图像生成器,以生成逼真且高保真的二维动画。实际应用中,该框架可提取特定角色的风格信息,并将其迁移至任意静态图像以生成说话头部动画。大量实验结果表明,无论从定性还是定量角度,本方法均优于当前最先进的技术方案。