In this work, we address the problem of 4D facial expressions generation. This is usually addressed by animating a neutral 3D face to reach an expression peak, and then get back to the neutral state. In the real world though, people show more complex expressions, and switch from one expression to another. We thus propose a new model that generates transitions between different expressions, and synthesizes long and composed 4D expressions. This involves three sub-problems: (i) modeling the temporal dynamics of expressions, (ii) learning transitions between them, and (iii) deforming a generic mesh. We propose to encode the temporal evolution of expressions using the motion of a set of 3D landmarks, that we learn to generate by training a manifold-valued GAN (Motion3DGAN). To allow the generation of composed expressions, this model accepts two labels encoding the starting and the ending expressions. The final sequence of meshes is generated by a Sparse2Dense mesh Decoder (S2D-Dec) that maps the landmark displacements to a dense, per-vertex displacement of a known mesh topology. By explicitly working with motion trajectories, the model is totally independent from the identity. Extensive experiments on five public datasets show that our proposed approach brings significant improvements with respect to previous solutions, while retaining good generalization to unseen data.
翻译:在本工作中,我们解决了4D面部表情生成问题。该问题通常通过动画化中性3D面部达到表情峰值,然后返回中性状态来解决。然而,在现实世界中,人们会展示更复杂的表情,并从一种表情切换到另一种。因此,我们提出了一种新模型,该模型能够生成不同表情之间的转换,并合成长期且组合的4D表情。这涉及三个子问题:(i) 建模表情的时间动态性,(ii) 学习表情之间的转换,以及(iii) 形变通用网格。我们提出利用一组3D地标的运动来编码表情的时间演化,并通过训练流形值GAN(Motion3DGAN)学习生成这些地标。为允许生成组合表情,该模型接受两个编码起始和结束表情的标签。最终的网格序列由稀疏到密集网格解码器(S2D-Dec)生成,该解码器将地标位移映射为已知网格拓扑的密集逐顶点位移。通过显式处理运动轨迹,该模型完全独立于身份。在五个公开数据集上的大量实验表明,我们提出的方法相比先前解决方案有显著改进,同时保持了对未见数据的良好泛化能力。