This paper presents a novel approach for generating 3D talking heads from raw audio inputs. Our method grounds on the idea that speech related movements can be comprehensively and efficiently described by the motion of a few control points located on the movable parts of the face, i.e., landmarks. The underlying musculoskeletal structure then allows us to learn how their motion influences the geometrical deformations of the whole face. The proposed method employs two distinct models to this aim: the first one learns to generate the motion of a sparse set of landmarks from the given audio. The second model expands such landmarks motion to a dense motion field, which is utilized to animate a given 3D mesh in neutral state. Additionally, we introduce a novel loss function, named Cosine Loss, which minimizes the angle between the generated motion vectors and the ground truth ones. Using landmarks in 3D talking head generation offers various advantages such as consistency, reliability, and obviating the need for manual-annotation. Our approach is designed to be identity-agnostic, enabling high-quality facial animations for any users without additional data or training.
翻译:本文提出了一种从原始音频输入生成三维说话头的新方法。我们的方法基于一个核心理念:与语音相关的运动可以通过位于面部可动部分(即地标)上的少量控制点的运动来全面且高效地描述。随后,潜在的肌肉骨骼结构使我们能够学习这些运动如何影响整个面部的几何变形。为此,所提出的方法采用两个不同的模型:第一个模型学习从给定音频生成稀疏地标集的运动;第二个模型将此类地标运动扩展为密集运动场,用于驱动给定中性状态的三维网格。此外,我们引入了一种新的损失函数——余弦损失,它最小化生成运动向量与真实运动向量之间的夹角。在三维说话头生成中使用地标具有诸多优势,例如一致性、可靠性,以及无需手动标注。我们的方法设计为与身份无关,能够为任何用户生成高质量的面部动画,且无需额外数据或训练。