Supervising 3D Talking Head Avatars with Analysis-by-Audio-Synthesis

In order to be widely applicable, speech-driven 3D head avatars must articulate their lips in accordance with speech, while also conveying the appropriate emotions with dynamically changing facial expressions. The key problem is that deterministic models produce high-quality lip-sync but without rich expressions, whereas stochastic models generate diverse expressions but with lower lip-sync quality. To get the best of both, we seek a stochastic model with accurate lip-sync. To that end, we develop a new approach based on the following observation: if a method generates realistic 3D lip motions, it should be possible to infer the spoken audio from the lip motion. The inferred speech should match the original input audio, and erroneous predictions create a novel supervision signal for training 3D talking head avatars with accurate lip-sync. To demonstrate this effect, we propose THUNDER (Talking Heads Under Neural Differentiable Elocution Reconstruction), a 3D talking head avatar framework that introduces a novel supervision mechanism via differentiable sound production. First, we train a novel mesh-to-speech model that regresses audio from facial animation. Then, we incorporate this model into a diffusion-based talking avatar framework. During training, the mesh-to-speech model takes the generated animation and produces a sound that is compared to the input speech, creating a differentiable analysis-by-audio-synthesis supervision loop. Our extensive qualitative and quantitative experiments demonstrate that THUNDER significantly improves the quality of the lip-sync of talking head avatars while still allowing for generation of diverse, high-quality, expressive facial animations. The code and models will be available at https://thunder.is.tue.mpg.de/

翻译：为实现广泛应用，语音驱动的三维头部头像必须根据语音准确控制唇部动作，同时通过动态变化的面部表情传递恰当的情感。核心问题在于：确定性模型能生成高质量的唇形同步效果但缺乏丰富的表情表现，而随机性模型能生成多样化的表情但唇形同步质量较低。为兼得二者优势，我们寻求一种具备精确唇形同步能力的随机性模型。为此，我们基于以下观察提出新方法：若某方法能生成真实的三维唇部运动，则应能从唇部运动推断出对应语音。推断的语音应与原始输入音频匹配，而错误预测将为训练具有精确唇形同步的三维说话人头像提供新型监督信号。为验证此效应，我们提出THUNDER（基于神经可微分发音重建的说话人头像）——一种通过可微分声音生成机制引入新型监督的三维说话人头像框架。首先，我们训练创新的网格到语音模型，该模型可从面部动画回归音频。随后，我们将该模型集成至基于扩散的说话人头像框架。在训练过程中，网格到语音模型接收生成的面部动画并产生声音，该声音与输入语音进行比较，从而形成可微分的音频合成分析监督循环。我们大量的定性与定量实验表明，THUNDER在保持生成多样化、高质量、富有表现力的面部动画能力的同时，显著提升了说话人头像的唇形同步质量。代码与模型将在https://thunder.is.tue.mpg.de/公开。