Audio-Driven Face Animation is an eagerly anticipated technique for applications such as VR/AR, games, and movie making. With the rapid development of 3D engines, there is an increasing demand for driving 3D faces with audio. However, currently available 3D face animation datasets are either scale-limited or quality-unsatisfied, which hampers further developments of audio-driven 3D face animation. To address this challenge, we propose MMFace4D, a large-scale multi-modal 4D (3D sequence) face dataset consisting of 431 identities, 35,904 sequences, and 3.9 million frames. MMFace4D has three appealing characteristics: 1) highly diversified subjects and corpus, 2) synchronized audio and 3D mesh sequence with high-resolution face details, and 3) low storage cost with a new efficient compression algorithm on 3D mesh sequences. These characteristics enable the training of high-fidelity, expressive, and generalizable face animation models. Upon MMFace4D, we construct a challenging benchmark of audio-driven 3D face animation with a strong baseline, which enables non-autoregressive generation with fast inference speed and outperforms the state-of-the-art autoregressive method. The whole benchmark will be released.
翻译:音频驱动的人脸动画是VR/AR、游戏及电影制作等领域备受期待的一项技术。随着三维引擎的快速发展,利用音频驱动三维人脸的需求日益增长。然而,现有三维人脸动画数据集存在规模有限或质量欠佳的问题,这阻碍了音频驱动三维人脸动画的进一步发展。为应对这一挑战,我们提出MMFace4D——一个大规模多模态4D(三维序列)人脸数据集,包含431个身份、35,904个序列以及390万帧。MMFace4D具有三个显著特性:1)高度多样化的受试者与语料库;2)同步的音频与高分辨率人脸细节三维网格序列;3)通过一种新型三维网格序列高效压缩算法实现低存储成本。这些特性使得训练高保真度、富有表现力且具备泛化能力的人脸动画模型成为可能。基于MMFace4D,我们构建了一个具有强基线模型的音频驱动三维人脸动画挑战性基准,实现了非自回归生成并具备快速推理速度,其性能优于当前最先进的自回归方法。该完整基准将公开发布。