The one-shot talking-head synthesis task aims to animate a source image to another pose and expression, which is dictated by a driving frame. Recent methods rely on warping the appearance feature extracted from the source, by using motion fields estimated from the sparse keypoints, that are learned in an unsupervised manner. Due to their lightweight formulation, they are suitable for video conferencing with reduced bandwidth. However, based on our study, current methods suffer from two major limitations: 1) unsatisfactory generation quality in the case of large head poses and the existence of observable pose misalignment between the source and the first frame in driving videos. 2) fail to capture fine yet critical face motion details due to the lack of semantic understanding and appropriate face geometry regularization. To address these shortcomings, we propose a novel method that leverages the rich face prior information, the proposed model can generate face videos with improved semantic consistency (improve baseline by $7\%$ in average keypoint distance) and expression-preserving (outperform baseline by $15 \%$ in average emotion embedding distance) under equivalent bandwidth. Additionally, incorporating such prior information provides us with a convenient interface to achieve highly controllable generation in terms of both pose and expression.
翻译:单次说话人表情合成任务旨在根据驱动帧的指示,将源图像动画化至另一姿态和表情。现有方法依赖从稀疏关键点估计的运动场,以无监督方式学习,来扭曲从源图像提取的外观特征。由于其轻量化结构,这些方法适用于降低带宽的视频会议。然而,根据我们的研究,当前方法存在两大局限:1) 在大姿态情况下生成质量不佳,且驱动视频中源图像与首帧之间存在可观测的姿态错位;2) 因缺乏语义理解与恰当的人脸几何正则化,无法捕捉精细且关键的面部运动细节。为克服这些缺陷,我们提出一种利用丰富人脸先验信息的新方法。所提模型能在等带宽下生成语义一致性更高(平均关键点距离相比基线提升$7\%$)且表情保留更佳(平均情感嵌入距离相比基线提升$15\%$)的人脸视频。此外,引入此类先验信息为我们提供了便捷接口,可在姿态和表情两方面实现高度可控的生成。