The creation of increasingly vivid 3D virtual digital humans has become a hot topic in recent years. Currently, most speech-driven work focuses on training models to learn the relationship between phonemes and visemes to achieve more realistic lips. However, they fail to capture the correlations between emotions and facial expressions effectively. To solve this problem, we propose a new model, termed EmoFace. EmoFace employs a novel Mesh Attention mechanism, which helps to learn potential feature dependencies between mesh vertices in time and space. We also adopt, for the first time to our knowledge, an effective self-growing training scheme that combines teacher-forcing and scheduled sampling in a 3D face animation task. Additionally, since EmoFace is an autoregressive model, there is no requirement that the first frame of the training data must be a silent frame, which greatly reduces the data limitations and contributes to solve the current dilemma of insufficient datasets. Comprehensive quantitative and qualitative evaluations on our proposed high-quality reconstructed 3D emotional facial animation dataset, 3D-RAVDESS ($5.0343\times 10^{-5}$mm for LVE and $1.0196\times 10^{-5}$mm for EVE), and publicly available dataset VOCASET ($2.8669\times 10^{-5}$mm for LVE and $0.4664\times 10^{-5}$mm for EVE), demonstrate that our algorithm achieves state-of-the-art performance.
翻译:近年来,构建日益逼真的三维虚拟数字人已成为研究热点。当前多数语音驱动方法侧重于训练模型学习音素与视位的关系以实现更真实的唇部运动,但未能有效捕捉情感与面部表情之间的关联。为解决此问题,本文提出名为EmoFace的新模型。该模型采用创新的网格注意力机制,有助于学习网格顶点在时间与空间维度上的潜在特征依赖关系。据我们所知,本研究首次在三维人脸动画任务中采用结合教师强制与计划采样的有效自增长训练方案。此外,由于EmoFace是自回归模型,训练数据的首帧无需保持静默状态,这大幅降低了数据限制,有助于缓解当前数据集不足的困境。在我们构建的高质量三维情感面部动画数据集3D-RAVDESS(LVE误差$5.0343\times 10^{-5}$mm,EVE误差$1.0196\times 10^{-5}$mm)及公开数据集VOCASET(LVE误差$2.8669\times 10^{-5}$mm,EVE误差$0.4664\times 10^{-5}$mm)上进行的综合定量与定性评估表明,本算法达到了当前最优性能。