Speech-driven 3D face animation poses significant challenges due to the intricacy and variability inherent in human facial movements. This paper emphasizes the importance of considering both the composite and regional natures of facial movements in speech-driven 3D face animation. The composite nature pertains to how speech-independent factors globally modulate speech-driven facial movements along the temporal dimension. Meanwhile, the regional nature alludes to the notion that facial movements are not globally correlated but are actuated by local musculature along the spatial dimension. It is thus indispensable to incorporate both natures for engendering vivid animation. To address the composite nature, we introduce an adaptive modulation module that employs arbitrary facial movements to dynamically adjust speech-driven facial movements across frames on a global scale. To accommodate the regional nature, our approach ensures that each constituent of the facial features for every frame focuses on the local spatial movements of 3D faces. Moreover, we present a non-autoregressive backbone for translating audio to 3D facial movements, which maintains high-frequency nuances of facial movements and facilitates efficient inference. Comprehensive experiments and user studies demonstrate that our method surpasses contemporary state-of-the-art approaches both qualitatively and quantitatively.
翻译:语音驱动的三维面部动画面临着因人类面部运动的复杂性和多样性而产生的重大挑战。本文强调了在语音驱动的三维面部动画中同时考虑面部运动的复合性和区域性的重要性。复合性指的是语音无关因素如何沿时间维度全局调制语音驱动的面部运动。而区域性则表明面部运动并非全局相关,而是由局部肌肉组织沿空间维度驱动。因此,融合这两种特性对于生成生动的动画至关重要。针对复合性,我们引入了一个自适应调制模块,该模块利用任意面部运动在全局范围内动态调整跨帧的语音驱动面部运动。为适应区域性,我们的方法确保每帧面部特征的每个组成部分关注三维面部的局部空间运动。此外,我们提出了一种非自回归主干网络,用于将音频转换为三维面部运动,该网络能保留面部运动的高频细节并实现高效推理。全面的实验和用户研究表明,我们的方法在定性和定量上均超越了当前最先进的方法。