Audio-Driven Face Animation is an eagerly anticipated technique for applications such as VR/AR, games, and movie making. With the rapid development of 3D engines, there is an increasing demand for driving 3D faces with audio. However, currently available 3D face animation datasets are either scale-limited or quality-unsatisfied, which hampers further developments of audio-driven 3D face animation. To address this challenge, we propose MMFace4D, a large-scale multi-modal 4D (3D sequence) face dataset consisting of 431 identities, 35,904 sequences, and 3.9 million frames. MMFace4D exhibits two compelling characteristics: 1) a remarkably diverse set of subjects and corpus, encompassing actors spanning ages 15 to 68, and recorded sentences with durations ranging from 0.7 to 11.4 seconds. 2) It features synchronized audio and 3D mesh sequences with high-resolution face details. To capture the subtle nuances of 3D facial expressions, we leverage three synchronized RGBD cameras during the recording process. Upon MMFace4D, we construct a non-autoregressive framework for audio-driven 3D face animation. Our framework considers the regional and composite natures of facial animations, and surpasses contemporary state-of-the-art approaches both qualitatively and quantitatively. The code, model, and dataset will be publicly available.
翻译:音频驱动的人脸动画是VR/AR、游戏及电影制作等领域备受期待的技术。随着3D引擎的快速发展,利用音频驱动三维人脸的需求日益增长。然而,目前可用的三维人脸动画数据集存在规模有限或质量不足的问题,这阻碍了音频驱动三维人脸动画的进一步发展。为解决这一挑战,我们提出MMFace4D——一个大规模多模态4D(三维序列)人脸数据集,包含431个身份、35,904个序列及390万帧。MMFace4D具有两个显著特征:1)主体和语料库具有高度多样性,涵盖年龄15至68岁的演员,录音句子时长范围为0.7至11.4秒;2)同步音频与高分辨率面部细节的三维网格序列。为捕捉三维面部表情的细微差异,我们在记录过程中使用三台同步RGBD相机。基于MMFace4D,我们构建了一个用于音频驱动三维人脸动画的非自回归框架。该框架考虑了面部动画的区域性和复合性,在定性和定量方面均超越了现有最先进方法。代码、模型及数据集将公开提供。