Current talking face generation methods mainly focus on speech-lip synchronization. However, insufficient investigation on the facial talking style leads to a lifeless and monotonous avatar. Most previous works fail to imitate expressive styles from arbitrary video prompts and ensure the authenticity of the generated video. This paper proposes an unsupervised variational style transfer model (VAST) to vivify the neutral photo-realistic avatars. Our model consists of three key components: a style encoder that extracts facial style representations from the given video prompts; a hybrid facial expression decoder to model accurate speech-related movements; a variational style enhancer that enhances the style space to be highly expressive and meaningful. With our essential designs on facial style learning, our model is able to flexibly capture the expressive facial style from arbitrary video prompts and transfer it onto a personalized image renderer in a zero-shot manner. Experimental results demonstrate the proposed approach contributes to a more vivid talking avatar with higher authenticity and richer expressiveness.
翻译:目前的说话人脸生成方法主要专注于语音与嘴唇的同步。然而,对面部说话风格的探索不足导致生成的虚拟人缺乏生气且单调。大多数先前的工作未能从任意视频提示中模仿出富有表现力的风格,同时保证生成视频的真实性。本文提出了一种无监督的变分风格迁移模型(VAST),旨在为中性逼真的虚拟人注入活力。我们的模型由三个关键部分组成:一个从给定视频提示中提取面部风格表示的风格编码器;一个用于建模与语音相关的精确运动的混合面部表情解码器;一个增强风格空间使其高度表达且富有意义的变分风格增强器。通过我们在面部风格学习上的重要设计,该模型能够灵活地从任意视频提示中捕捉富有表现力的面部风格,并以零样本方式将其迁移到个性化图像渲染器上。实验结果表明,所提出的方法有助于生成更生动、具有更高真实性和更强表现力的说话虚拟人。