4D Facial Expression Diffusion Model

Facial expression generation is one of the most challenging and long-sought aspects of character animation, with many interesting applications. The challenging task, traditionally having relied heavily on digital craftspersons, remains yet to be explored. In this paper, we introduce a generative framework for generating 3D facial expression sequences (i.e. 4D faces) that can be conditioned on different inputs to animate an arbitrary 3D face mesh. It is composed of two tasks: (1) Learning the generative model that is trained over a set of 3D landmark sequences, and (2) Generating 3D mesh sequences of an input facial mesh driven by the generated landmark sequences. The generative model is based on a Denoising Diffusion Probabilistic Model (DDPM), which has achieved remarkable success in generative tasks of other domains. While it can be trained unconditionally, its reverse process can still be conditioned by various condition signals. This allows us to efficiently develop several downstream tasks involving various conditional generation, by using expression labels, text, partial sequences, or simply a facial geometry. To obtain the full mesh deformation, we then develop a landmark-guided encoder-decoder to apply the geometrical deformation embedded in landmarks on a given facial mesh. Experiments show that our model has learned to generate realistic, quality expressions solely from the dataset of relatively small size, improving over the state-of-the-art methods. Videos and qualitative comparisons with other methods can be found at https://github.com/ZOUKaifeng/4DFM. Code and models will be made available upon acceptance.

翻译：面部表情生成是角色动画中最具挑战性且长期追求的目标之一，具有许多有趣的应用场景。这一传统上严重依赖数字工匠的艰巨任务仍有待深入探索。本文提出了一种生成3D面部表情序列（即4D面部）的生成框架，该框架可根据不同输入条件驱动任意3D面部网格动画。它包含两个任务：（1）学习基于3D关键点序列训练的生成模型；（2）利用生成的关键点序列驱动输入面部网格生成3D网格序列。该生成模型基于去噪扩散概率模型（DDPM），该模型已在其他领域的生成任务中取得显著成功。虽然该模型可以进行无条件训练，但其逆向过程仍可通过各种条件信号进行约束。这使我们能够通过使用表情标签、文本、部分序列或仅使用面部几何结构，高效开发涉及多种条件生成的下游任务。为获取完整的网格形变，我们进一步开发了关键点引导的编码器-解码器，将关键点中蕴含的几何形变应用到给定面部网格上。实验表明，我们的模型仅在相对较小规模的数据集上学习便能生成逼真、高质量的表情，性能优于现有最优方法。视频及与其他方法的定性比较结果见https://github.com/ZOUKaifeng/4DFM。代码和模型将在论文接收后公开。