Talking head synthesis with arbitrary speech audio is a crucial challenge in the field of digital humans. Recently, methods based on radiance fields have received increasing attention due to their ability to synthesize high-fidelity and identity-consistent talking heads from just a few minutes of training video. However, due to the limited scale of the training data, these methods often exhibit poor performance in audio-lip synchronization and visual quality. In this paper, we propose a novel 3D Gaussian-based method called PointTalk, which constructs a static 3D Gaussian field of the head and deforms it in sync with the audio. It also incorporates an audio-driven dynamic lip point cloud as a critical component of the conditional information, thereby facilitating the effective synthesis of talking heads. Specifically, the initial step involves generating the corresponding lip point cloud from the audio signal and capturing its topological structure. The design of the dynamic difference encoder aims to capture the subtle nuances inherent in dynamic lip movements more effectively. Furthermore, we integrate the audio-point enhancement module, which not only ensures the synchronization of the audio signal with the corresponding lip point cloud within the feature space, but also facilitates a deeper understanding of the interrelations among cross-modal conditional features. Extensive experiments demonstrate that our method achieves superior high-fidelity and audio-lip synchronization in talking head synthesis compared to previous methods.
翻译:基于任意语音音频的说话头合成是数字人领域的关键挑战。近年来,基于辐射场的方法因其仅需数分钟训练视频即可合成高保真且身份一致的说话头而受到越来越多的关注。然而,由于训练数据规模有限,这些方法通常在音频-唇形同步与视觉质量方面表现欠佳。本文提出一种名为PointTalk的新型3D高斯方法,该方法构建头部的静态3D高斯场并使其随音频同步形变,同时引入音频驱动的动态唇部点云作为条件信息的关键组成部分,从而有效促进说话头的合成。具体而言,首先生成音频信号对应的唇部点云并捕捉其拓扑结构。动态差分编码器的设计旨在更有效地捕捉动态唇部运动固有的细微特征。此外,我们整合了音频-点云增强模块,该模块不仅确保特征空间中音频信号与对应唇部点云的同步性,还促进了对跨模态条件特征间相互关系的深入理解。大量实验表明,与现有方法相比,我们的方法在说话头合成中实现了更优的高保真度与音频-唇形同步效果。