Speech-driven 3D facial animation aims at generating facial movements that are synchronized with the driving speech, which has been widely explored recently. Existing works mostly neglect the person-specific talking style in generation, including facial expression and head pose styles. Several works intend to capture the personalities by fine-tuning modules. However, limited training data leads to the lack of vividness. In this work, we propose AdaMesh, a novel adaptive speech-driven facial animation approach, which learns the personalized talking style from a reference video of about 10 seconds and generates vivid facial expressions and head poses. Specifically, we propose mixture-of-low-rank adaptation (MoLoRA) to fine-tune the expression adapter, which efficiently captures the facial expression style. For the personalized pose style, we propose a pose adapter by building a discrete pose prior and retrieving the appropriate style embedding with a semantic-aware pose style matrix without fine-tuning. Extensive experimental results show that our approach outperforms state-of-the-art methods, preserves the talking style in the reference video, and generates vivid facial animation. The supplementary video and code will be available at https://adamesh.github.io.
翻译:摘要:语音驱动的三维面部动画旨在生成与驱动语音同步的面部运动,近年来得到了广泛探索。现有方法大多忽略了生成过程中人物特定的说话风格,包括面部表情和头部姿态风格。部分工作尝试通过微调模块来捕捉个性特征,然而有限训练数据导致生成结果缺乏生动性。本文提出AdaMesh——一种新颖的自适应语音驱动面部动画方法,该方法能从约10秒的参考视频中学习个性化说话风格,并生成生动的面部表情与头部姿态。具体而言,我们提出混合低秩适应(MoLoRA)来微调表情适配器,从而高效捕捉面部表情风格;针对个性化姿态风格,我们通过构建离散姿态先验并借助语义感知姿态风格矩阵检索相应风格嵌入,无需微调即可实现姿态适配。大量实验结果表明,我们的方法优于现有先进技术,能保留参考视频中的说话风格,生成生动的面部动画。补充视频与代码详见https://adamesh.github.io。