Generating talking face videos from audio attracts lots of research interest. A few person-specific methods can generate vivid videos but require the target speaker's videos for training or fine-tuning. Existing person-generic methods have difficulty in generating realistic and lip-synced videos while preserving identity information. To tackle this problem, we propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures. First, we devise a novel Transformer-based landmark generator to infer lip and jaw landmarks from the audio. Prior landmark characteristics of the speaker's face are employed to make the generated landmarks coincide with the facial outline of the speaker. Then, a video rendering model is built to translate the generated landmarks into face images. During this stage, prior appearance information is extracted from the lower-half occluded target face and static reference images, which helps generate realistic and identity-preserving visual content. For effectively exploring the prior information of static reference images, we align static reference images with the target face's pose and expression based on motion fields. Moreover, auditory features are reused to guarantee that the generated face images are well synchronized with the audio. Extensive experiments demonstrate that our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.
翻译:从音频生成说话人脸视频吸引了大量研究兴趣。少数特定人物方法能生成生动视频,但需要目标说话人的视频进行训练或微调。现有通用人物方法难以在保留身份信息的同时生成逼真且口型同步的视频。为解决此问题,我们提出一个包含音频到地标生成和地标到视频渲染两个阶段的两阶段框架。首先,我们设计了一个基于Transformer的新型地标生成器,从音频中推断嘴唇和下颌地标。利用说话人面部先验地标特征,使生成的地标与说话人面部轮廓一致。然后,构建视频渲染模型将生成的地标转换为面部图像。在此阶段,从下半部分遮挡的目标人脸和静态参考图像中提取先验外观信息,有助于生成逼真且保留身份的视觉内容。为有效探索静态参考图像的先验信息,我们基于运动场将静态参考图像与目标人脸的姿态和表情对齐。此外,重复使用听觉特征以确保生成的面部图像与音频良好同步。大量实验表明,我们的方法相比现有通用人物说话人脸生成方法,能生成更逼真、口型同步且保留身份的视频。