Speech-driven 3D facial animation aims to synthesize vivid facial animations that accurately synchronize with speech and match the unique speaking style. However, existing works primarily focus on achieving precise lip synchronization while neglecting to model the subject-specific speaking style, often resulting in unrealistic facial animations. To the best of our knowledge, this work makes the first attempt to explore the coupled information between the speaking style and the semantic content in facial motions. Specifically, we introduce an innovative speaking style disentanglement method, which enables arbitrary-subject speaking style encoding and leads to a more realistic synthesis of speech-driven facial animations. Subsequently, we propose a novel framework called \textbf{Mimic} to learn disentangled representations of the speaking style and content from facial motions by building two latent spaces for style and content, respectively. Moreover, to facilitate disentangled representation learning, we introduce four well-designed constraints: an auxiliary style classifier, an auxiliary inverse classifier, a content contrastive loss, and a pair of latent cycle losses, which can effectively contribute to the construction of the identity-related style space and semantic-related content space. Extensive qualitative and quantitative experiments conducted on three publicly available datasets demonstrate that our approach outperforms state-of-the-art methods and is capable of capturing diverse speaking styles for speech-driven 3D facial animation. The source code and supplementary video are publicly available at: https://zeqing-wang.github.io/Mimic/
翻译:摘要:语音驱动三维面部动画旨在合成逼真的面部动画,使其与语音精确同步并匹配独特的说话风格。然而,现有工作主要侧重于实现精确的唇形同步,而忽略了对个体特定说话风格的建模,常导致生成的面部动画不够真实。据我们所知,本工作首次尝试探究面部运动中说话风格与语义内容之间的耦合信息。具体而言,我们提出了一种创新的说话风格解耦方法,该方法能够实现对任意对象的说话风格编码,从而生成更逼真的语音驱动面部动画。随后,我们提出了一种名为\textbf{Mimic}的新框架,通过分别为风格和内容构建两个潜空间,从面部运动中学习说话风格与内容的解耦表征。此外,为促进解耦表征学习,我们引入了四种精心设计的约束:辅助风格分类器、辅助逆分类器、内容对比损失以及一对潜循环损失,这些约束能有效促进与身份相关的风格空间及与语义相关的内容空间的构建。在三个公开数据集上进行的广泛定性与定量实验表明,我们的方法优于现有最优方法,并能捕获语音驱动三维面部动画中多样化的说话风格。源代码及补充视频已公开于:https://zeqing-wang.github.io/Mimic/