Recent methods for audio-driven talking head synthesis often optimize neural radiance fields (NeRF) on a monocular talking portrait video, leveraging its capability to render high-fidelity and 3D-consistent novel-view frames. However, they often struggle to reconstruct complete face geometry due to the absence of comprehensive 3D information in the input monocular videos. In this paper, we introduce a novel audio-driven talking head synthesis framework, called Talk3D, that can faithfully reconstruct its plausible facial geometries by effectively adopting the pre-trained 3D-aware generative prior. Given the personalized 3D generative model, we present a novel audio-guided attention U-Net architecture that predicts the dynamic face variations in the NeRF space driven by audio. Furthermore, our model is further modulated by audio-unrelated conditioning tokens which effectively disentangle variations unrelated to audio features. Compared to existing methods, our method excels in generating realistic facial geometries even under extreme head poses. We also conduct extensive experiments showing our approach surpasses state-of-the-art benchmarks in terms of both quantitative and qualitative evaluations.
翻译:近期基于音频驱动的说话头合成方法通常通过单目说话肖像视频优化神经辐射场(NeRF),利用其渲染高保真、三维一致的新视角帧的能力。然而,由于输入单目视频缺乏完整的3D信息,这些方法常难以重建完整的面部几何结构。本文提出一种新颖的音频驱动说话头合成框架Talk3D,通过有效采用预训练的3D感知生成先验,能够忠实地重建合理的面部几何结构。基于个性化三维生成模型,我们提出一种新型音频引导注意力U-Net架构,可预测由音频驱动的NeRF空间中的动态面部变化。此外,我们的模型通过音频无关的条件化token进一步调制,有效解耦与音频特征无关的变化。与现有方法相比,本方法即使在极端头部姿态下也能生成逼真的面部几何结构。广泛实验表明,本方法在定量和定性评估中均超越当前最优基准。