This paper presents a novel approach for generating 3D talking heads from raw audio inputs. Our method grounds on the idea that speech related movements can be comprehensively and efficiently described by the motion of a few control points located on the movable parts of the face, i.e., landmarks. The underlying musculoskeletal structure then allows us to learn how their motion influences the geometrical deformations of the whole face. The proposed method employs two distinct models to this aim: the first one learns to generate the motion of a sparse set of landmarks from the given audio. The second model expands such landmarks motion to a dense motion field, which is utilized to animate a given 3D mesh in neutral state. Additionally, we introduce a novel loss function, named Cosine Loss, which minimizes the angle between the generated motion vectors and the ground truth ones. Using landmarks in 3D talking head generation offers various advantages such as consistency, reliability, and obviating the need for manual-annotation. Our approach is designed to be identity-agnostic, enabling high-quality facial animations for any users without additional data or training.
翻译:本文提出了一种从原始音频输入生成3D说话头部的新方法。该方法基于以下理念:与语音相关的运动可以通过位于面部可动部位(即地标)的少量控制点的运动进行全面且高效的描述。基础骨骼肌肉结构使我们能够学习这些运动如何影响整个面部的几何形变。为实现这一目标,所提方法采用了两个不同的模型:第一个模型学习从给定音频生成稀疏地标集的运动,第二个模型将这种地标运动扩展为密集运动场,用于驱动处于中性状态的给定3D网格。此外,我们引入了一种名为余弦损失的新型损失函数,该函数最小化生成运动向量与真实运动向量之间的夹角。在3D说话头部生成中使用地标具有一致性、可靠性以及无需手动标注等优势。我们的方法被设计为身份无关的,无需额外数据或训练即可为任意用户生成高质量的面部动画。