Given an arbitrary audio clip, audio-driven 3D facial animation aims to generate lifelike lip motions and facial expressions for a 3D head. Existing methods typically rely on training their models using limited public 3D datasets that contain a restricted number of audio-3D scan pairs. Consequently, their generalization capability remains limited. In this paper, we propose a novel method that leverages in-the-wild 2D talking-head videos to train our 3D facial animation model. The abundance of easily accessible 2D talking-head videos equips our model with a robust generalization capability. By combining these videos with existing 3D face reconstruction methods, our model excels in generating consistent and high-fidelity lip synchronization. Additionally, our model proficiently captures the speaking styles of different individuals, allowing it to generate 3D talking-heads with distinct personal styles. Extensive qualitative and quantitative experimental results demonstrate the superiority of our method.
翻译:给定任意音频片段,音频驱动的3D面部动画旨在为3D头部模型生成逼真的唇部动作与面部表情。现有方法通常依赖有限公开3D数据集(仅含少量音频-3D扫描对)训练模型,导致其泛化能力受限。本文提出一种创新方法,利用野外2D说话人头视频训练3D面部动画模型。大量易获取的2D视频资源赋予模型强大的泛化能力。通过将视频与现有3D面部重建方法相结合,本模型在生成一致且高保真的唇形同步方面表现卓越。此外,该模型能精准捕捉不同个体的说话风格,从而生成具备独特个人风格的3D说话人头像。大量定性与定量实验结果表明,本方法具有显著优越性。