Speech-driven 3D facial animation technology has been developed for years, but its practical application still lacks expectations. The main challenges lie in data limitations, lip alignment, and the naturalness of facial expressions. Although lip alignment has seen many related studies, existing methods struggle to synthesize natural and realistic expressions, resulting in a mechanical and stiff appearance of facial animations. Even with some research extracting emotional features from speech, the randomness of facial movements limits the effective expression of emotions. To address this issue, this paper proposes a method called CSTalk (Correlation Supervised) that models the correlations among different regions of facial movements and supervises the training of the generative model to generate realistic expressions that conform to human facial motion patterns. To generate more intricate animations, we employ a rich set of control parameters based on the metahuman character model and capture a dataset for five different emotions. We train a generative network using an autoencoder structure and input an emotion embedding vector to achieve the generation of user-control expressions. Experimental results demonstrate that our method outperforms existing state-of-the-art methods.
翻译:语音驱动的3D面部动画技术已发展多年,但其实际应用仍存在不足。主要挑战在于数据限制、唇形对齐以及面部表情的自然性。尽管唇形对齐已有诸多相关研究,现有方法仍难以合成自然逼真的表情,导致面部动画呈现机械僵硬的外观。即便有研究从语音中提取情感特征,面部运动的随机性仍限制了情感的有效表达。为解决此问题,本文提出一种名为CSTalk(相关性监督)的方法,通过建模面部不同区域运动之间的相关性,并监督生成模型的训练,以生成符合人类面部运动模式的逼真表情。为生成更复杂的动画,我们基于超写实数字人角色模型采用丰富的控制参数,并针对五种不同情感采集数据集。利用自编码器结构训练生成网络,并输入情感嵌入向量,实现用户可控表情的生成。实验结果表明,本方法优于现有最先进方法。