In this work, we present a multimodal solution to the problem of 4D face reconstruction from monocular videos. 3D face reconstruction from 2D images is an under-constrained problem due to the ambiguity of depth. State-of-the-art methods try to solve this problem by leveraging visual information from a single image or video, whereas 3D mesh animation approaches rely more on audio. However, in most cases (e.g. AR/VR applications), videos include both visual and speech information. We propose AVFace that incorporates both modalities and accurately reconstructs the 4D facial and lip motion of any speaker, without requiring any 3D ground truth for training. A coarse stage estimates the per-frame parameters of a 3D morphable model, followed by a lip refinement, and then a fine stage recovers facial geometric details. Due to the temporal audio and video information captured by transformer-based modules, our method is robust in cases when either modality is insufficient (e.g. face occlusions). Extensive qualitative and quantitative evaluation demonstrates the superiority of our method over the current state-of-the-art.
翻译:在本文中,我们提出了一种针对单目视频4D人脸重建问题的多模态解决方案。由于深度信息的模糊性,从二维图像进行三维人脸重建是一个欠约束问题。现有最先进方法试图通过利用单张图像或视频的视觉信息来解决该问题,而三维网格动画方法则更依赖音频。然而在大多数场景中(例如AR/VR应用),视频同时包含视觉和语音信息。我们提出的AVFace方法整合了两种模态,能够精确重建任意说话者的4D面部和唇部运动,且训练过程无需任何三维真实数据。该方法通过粗阶段估计三维可变形模型的逐帧参数,经唇部细化处理后,再由精细阶段恢复面部几何细节。得益于基于Transformer的模块捕获的时序音频和视频信息,我们的方法在任一模态信息不足(如面部遮挡)的情况下仍具有鲁棒性。大量定性与定量评估证明了本方法相比当前最先进技术的优越性。