One-shot talking head generation produces lip-sync talking heads based on arbitrary audio and one source face. To guarantee the naturalness and realness, recent methods propose to achieve free pose control instead of simply editing mouth areas. However, existing methods do not preserve accurate identity of source face when generating head motions. To solve the identity mismatch problem and achieve high-quality free pose control, we present One-shot Pose-controllable Talking head generation network (OPT). Specifically, the Audio Feature Disentanglement Module separates content features from audios, eliminating the influence of speaker-specific information contained in arbitrary driving audios. Later, the mouth expression feature is extracted from the content feature and source face, during which the landmark loss is designed to enhance the accuracy of facial structure and identity preserving quality. Finally, to achieve free pose control, controllable head pose features from reference videos are fed into the Video Generator along with the expression feature and source face to generate new talking heads. Extensive quantitative and qualitative experimental results verify that OPT generates high-quality pose-controllable talking heads with no identity mismatch problem, outperforming previous SOTA methods.
翻译:一步式说话头生成技术能够基于任意音频和单张源人脸生成唇形同步的说话头。为确保自然度与真实感,近期方法提出实现自由的姿态控制,而非简单编辑嘴部区域。然而,现有方法在生成头部运动时未能准确保留源人脸的身份特征。为解决身份不匹配问题并实现高质量的自由姿态控制,我们提出了一步式可控姿态说话头生成网络(OPT)。具体而言,音频特征解耦模块从音频中分离内容特征,消除任意驱动音频中包含的说话者特定信息的影响。随后,从内容特征和源人脸中提取嘴部表情特征,在此过程中设计地标损失以增强面部结构的准确性和身份保留质量。最后,为实现自由姿态控制,从参考视频中提取的可控头部姿态特征与表情特征及源人脸一同输入视频生成器,生成新的说话头。大量定量与定性实验结果表明,OPT 能够生成高质量、无身份不匹配问题的可控姿态说话头,性能优于先前的最优方法。