Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs

We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters using RGB video data captured using commodity cameras. Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions. Given a speech audio waveform and a token sequence of the speaker's face landmark motion and body-joint motion computed from a video, our method synthesizes the motion sequences for the speaker's face landmarks and body joints to match the content and the affect of the speech. We design a generator consisting of a set of encoders to transform all the inputs into a multimodal embedding space capturing their correlations, followed by a pair of decoders to synthesize the desired face and pose motions. To enhance the plausibility of synthesis, we use an adversarial discriminator that learns to differentiate between the face and pose motions computed from the original videos and our synthesized motions based on their affective expressions. To evaluate our approach, we extend the TED Gesture Dataset to include view-normalized, co-speech face landmarks in addition to body gestures. We demonstrate the performance of our method through thorough quantitative and qualitative experiments on multiple evaluation metrics and via a user study. We observe that our method results in low reconstruction error and produces synthesized samples with diverse facial expressions and body gestures for digital characters.

翻译：我们提出一种基于多模态学习的方法，能够利用消费级摄像头采集的RGB视频数据，同时为数字角色合成与语音同步的面部表情和上半身手势。我们的方法从视频数据中直接估计的稀疏面部关键点和上半身关节点进行学习，以生成具有合理情感表现的角色动作。给定一段语音音频波形，以及从视频中计算得到的说话者面部关键点运动和身体关节运动的标记序列，我们的方法能够合成说话者面部关键点和身体关节的运动序列，以匹配语音的内容和情感。我们设计了一个生成器，它包含一组编码器，用于将所有输入转换到一个捕获其相关性的多模态嵌入空间，随后通过一对解码器合成所需的面部和姿态运动。为了增强合成的合理性，我们采用了一个对抗性判别器，该判别器学习根据情感表达来区分从原始视频计算出的面部与姿态运动和我们合成的运动。为了评估我们的方法，我们在TED手势数据集的基础上进行了扩展，使其除了身体手势外，还包含了视角归一化的、与语音同步的面部关键点数据。我们通过在多组评估指标上进行全面的定量与定性实验，以及一项用户研究，展示了我们方法的性能。我们观察到，我们的方法实现了较低的重建误差，并为数字角色生成了具有多样化面部表情和身体手势的合成样本。