The goal of this paper is to synthesise talking faces with controllable facial motions. To achieve this goal, we propose two key ideas. The first is to establish a canonical space where every face has the same motion patterns but different identities. The second is to navigate a multimodal motion space that only represents motion-related features while eliminating identity information. To disentangle identity and motion, we introduce an orthogonality constraint between the two different latent spaces. From this, our method can generate natural-looking talking faces with fully controllable facial attributes and accurate lip synchronisation. Extensive experiments demonstrate that our method achieves state-of-the-art results in terms of both visual quality and lip-sync score. To the best of our knowledge, we are the first to develop a talking face generation framework that can accurately manifest full target facial motions including lip, head pose, and eye movements in the generated video without any additional supervision beyond RGB video with audio.
翻译:本文的目标是合成具有可控面部运动的说话人脸。为实现这一目标,我们提出两个关键思想:首先,建立一个规范空间,其中每张人脸具有相同的运动模式但不同的身份特征;其次,构建一个仅表征运动相关特征而消除身份信息的多模态运动空间。为解耦身份与运动,我们在这两个不同潜空间之间引入正交性约束。由此,我们的方法能够生成具有完全可控面部属性与精确唇部同步的自然说话人脸。大量实验表明,该方法在视觉质量与唇同步得分方面均达到最优水平。据我们所知,这是首个无需RGB视频与音频之外的额外监督,即可在生成视频中精确呈现包含唇部、头部姿态及眼部运动在内的完整目标面部动作的说话人脸生成框架。