Speech-driven 3D facial animation has been widely studied, yet there is still a gap to achieving realism and vividness due to the highly ill-posed nature and scarcity of audio-visual data. Existing works typically formulate the cross-modal mapping into a regression task, which suffers from the regression-to-mean problem leading to over-smoothed facial motions. In this paper, we propose to cast speech-driven facial animation as a code query task in a finite proxy space of the learned codebook, which effectively promotes the vividness of the generated motions by reducing the cross-modal mapping uncertainty. The codebook is learned by self-reconstruction over real facial motions and thus embedded with realistic facial motion priors. Over the discrete motion space, a temporal autoregressive model is employed to sequentially synthesize facial motions from the input speech signal, which guarantees lip-sync as well as plausible facial expressions. We demonstrate that our approach outperforms current state-of-the-art methods both qualitatively and quantitatively. Also, a user study further justifies our superiority in perceptual quality.
翻译:语音驱动的三维面部动画已被广泛研究,但由于其高度病态性以及音视频数据的稀缺性,在实现真实感和生动性方面仍存在差距。现有工作通常将跨模态映射构建为回归任务,这会导致回归到均值问题,从而产生过度平滑的面部运动。本文提出将语音驱动的面部动画视为学习到的码本有限代理空间中的码查询任务,通过降低跨模态映射的不确定性,有效提升生成运动的生动性。码本通过真实面部运动的自重建进行学习,因此嵌入了真实面部运动先验。在离散运动空间上,采用时序自回归模型从输入语音信号中逐步合成面部运动,这既保证了唇形同步,也确保了合理的面部表情。我们证明,无论定性还是定量评估,该方法均优于当前最先进的技术。此外,用户研究进一步证实了我们在感知质量上的优越性。