Toward Fine-Grained Facial Control in 3D Talking Head Generation

Audio-driven talking head generation is a core component of digital avatars, and 3D Gaussian Splatting has shown strong performance in real-time rendering of high-fidelity talking heads. However, achieving precise control over fine-grained facial movements remains a significant challenge, particularly due to lip-synchronization inaccuracies and facial jitter, both of which can contribute to the uncanny valley effect. To address these challenges, we propose Fine-Grained 3D Gaussian Splatting (FG-3DGS), a novel framework that enables temporally consistent and high-fidelity talking head generation. Our method introduces a frequency-aware disentanglement strategy to explicitly model facial regions based on their motion characteristics. Low-frequency regions, such as the cheeks, nose, and forehead, are jointly modeled using a standard MLP, while high-frequency regions, including the eyes and mouth, are captured separately using a dedicated network guided by facial area masks. The predicted motion dynamics, represented as Gaussian deltas, are applied to the static Gaussians to generate the final head frames, which are rendered via a rasterizer using frame-specific camera parameters. Additionally, a high-frequency-refined post-rendering alignment mechanism, learned from large-scale audio-video pairs by a pretrained model, is incorporated to enhance per-frame generation and achieve more accurate lip synchronization. Extensive experiments on widely used datasets for talking head generation demonstrate that our method outperforms recent state-of-the-art approaches in producing high-fidelity, lip-synced talking head videos.

翻译：音频驱动的说话头生成是数字虚拟人的核心组成部分，而3D高斯溅射技术已在实时渲染高保真说话头方面展现出强大性能。然而，实现细粒度面部运动的精确控制仍是一个重大挑战，这主要源于唇部同步不准确和面部抖动问题，二者均可能加剧恐怖谷效应。为解决这些挑战，我们提出了细粒度3D高斯溅射（FG-3DGS），这是一个能够实现时间一致且高保真说话头生成的新型框架。我们的方法引入了一种频率感知解耦策略，根据面部区域的运动特性进行显式建模。低频区域（如脸颊、鼻子和前额）通过标准MLP进行联合建模，而高频区域（包括眼睛和嘴巴）则通过由面部区域掩码引导的专用网络进行独立捕捉。以高斯增量表示的运动动态预测结果被应用于静态高斯模型，以生成最终头部帧，这些帧通过使用逐帧相机参数的光栅化渲染器进行渲染。此外，我们引入了一种高频细化后渲染对齐机制，该机制通过预训练模型从大规模音视频对中学习，以增强逐帧生成质量并实现更精确的唇部同步。在说话头生成领域广泛使用的数据集上进行的大量实验表明，我们的方法在生成高保真、唇部同步的说话头视频方面优于当前最先进的技术。