We propose StyleTalker, a novel audio-driven talking head generation model that can synthesize a video of a talking person from a single reference image with accurately audio-synced lip shapes, realistic head poses, and eye blinks. Specifically, by leveraging a pretrained image generator and an image encoder, we estimate the latent codes of the talking head video that faithfully reflects the given audio. This is made possible with several newly devised components: 1) A contrastive lip-sync discriminator for accurate lip synchronization, 2) A conditional sequential variational autoencoder that learns the latent motion space disentangled from the lip movements, such that we can independently manipulate the motions and lip movements while preserving the identity. 3) An auto-regressive prior augmented with normalizing flow to learn a complex audio-to-motion multi-modal latent space. Equipped with these components, StyleTalker can generate talking head videos not only in a motion-controllable way when another motion source video is given but also in a completely audio-driven manner by inferring realistic motions from the input audio. Through extensive experiments and user studies, we show that our model is able to synthesize talking head videos with impressive perceptual quality which are accurately lip-synced with the input audios, largely outperforming state-of-the-art baselines.
翻译:我们提出StyleTalker,一种新颖的音频驱动说话头生成模型,能够从单张参考图像合成说话人的视频,并实现与音频精确同步的唇形、逼真的头部姿态和眨眼动作。具体而言,通过利用预训练的图像生成器与图像编码器,我们估计出忠实反映给定音频的说话头视频的潜在编码。这得益于多个新设计的组件:(1)对比性唇形同步判别器,用于精确的唇形同步;(2)条件序列变分自编码器,学习与唇部运动解耦的潜在运动空间,从而在保持身份特征的同时独立操控运动与唇部动作;(3)结合归一化流的自回归先验,用于学习复杂的音频-运动多模态潜在空间。配备这些组件后,StyleTalker不仅能以运动可控的方式(给定另一运动源视频时)生成说话头视频,还能通过从输入音频推断逼真运动,实现完全音频驱动的生成。通过大量实验与用户研究,我们证明该模型能够合成具有出色感知质量的说话头视频,其唇形与输入音频精确同步,大幅优于现有最优基线方法。