We introduce GaussianSpeech, a novel approach that synthesizes high-fidelity animation sequences of photo-realistic, personalized 3D human head avatars from spoken audio. To capture the expressive, detailed nature of human heads, including skin furrowing and finer-scale facial movements, we propose to couple speech signal with 3D Gaussian splatting to create realistic, temporally coherent motion sequences. We propose a compact and efficient 3DGS-based avatar representation that generates expression-dependent color and leverages wrinkle- and perceptually-based losses to synthesize facial details, including wrinkles that occur with different expressions. To enable sequence modeling of 3D Gaussian splats with audio, we devise an audio-conditioned transformer model capable of extracting lip and expression features directly from audio input. Due to the absence of high-quality datasets of talking humans in correspondence with audio, we captured a new large-scale multi-view dataset of audio-visual sequences of talking humans with native English accents and diverse facial geometry. GaussianSpeech consistently achieves state-of-the-art performance with visually natural motion at real time rendering rates, while encompassing diverse facial expressions and styles.
翻译:我们提出了高斯语音,一种从语音音频合成高保真、个性化三维人头化身动画序列的新方法。为捕捉人类头部富有表现力且精细的特性,包括皮肤褶皱和更细微的面部运动,我们提出将语音信号与三维高斯泼溅相结合,以生成逼真且时间连贯的运动序列。我们提出了一种紧凑高效的三维高斯泼溅化身表示方法,该方法能生成表情依赖的色彩,并利用基于皱纹和感知的损失函数来合成面部细节,包括伴随不同表情产生的皱纹。为实现音频驱动的三维高斯泼溅序列建模,我们设计了一个音频条件Transformer模型,能够直接从音频输入中提取唇部和表情特征。由于缺乏与音频对应的高质量人类说话数据集,我们采集了一个新的大规模多视角数据集,包含以英语为母语、具有多样面部几何特征的人类说话视听序列。高斯语音在实时渲染速率下,始终能实现最先进的性能,产生视觉自然的运动,同时涵盖多样的面部表情和风格。