Talking Head Generation aims at synthesizing natural-looking talking videos from speech and a single portrait image. Previous 3D talking head generation methods have relied on domain-specific heuristics such as warping-based facial motion representation priors to animate talking motions, yet still produce inaccurate 3D avatar reconstructions, thus undermining the realism of generated animations. We introduce Splat-Portrait, a Gaussian-splatting-based method that addresses the challenges of 3D head reconstruction and lip motion synthesis. Our approach automatically learns to disentangle a single portrait image into a static 3D reconstruction represented as static Gaussian Splatting, and a predicted whole-image 2D background. It then generates natural lip motion conditioned on input audio, without any motion driven priors. Training is driven purely by 2D reconstruction and score-distillation losses, without 3D supervision nor landmarks. Experimental results demonstrate that Splat-Portrait exhibits superior performance on talking head generation and novel view synthesis, achieving better visual quality compared to previous works. Our project code and supplementary documents are public available at https://github.com/stonewalking/Splat-portrait.
翻译:说话头部生成旨在从语音和单张肖像图像合成自然逼真的说话视频。先前的三维说话头部生成方法依赖于领域特定的启发式方法,例如基于形变的面部运动表示先验来驱动说话动作,但仍会产生不准确的三维虚拟形象重建,从而损害生成动画的真实感。我们提出了Splat-Portrait,一种基于高斯溅射的方法,以解决三维头部重建和唇部运动合成的挑战。我们的方法能自动学习将单张肖像图像解耦为以静态高斯溅射表示的静态三维重建和预测的全图像二维背景。随后,该方法根据输入音频生成自然的唇部运动,无需任何运动驱动先验。训练过程完全由二维重建损失和分数蒸馏损失驱动,无需三维监督或面部关键点。实验结果表明,Splat-Portrait在说话头部生成和新视角合成方面表现出卓越性能,相较于先前工作实现了更优的视觉质量。我们的项目代码与补充材料已公开于 https://github.com/stonewalking/Splat-portrait。