Reconstructing photorealistic and animatable 4D head avatars from a single portrait image remains a fundamental challenge in computer vision. While diffusion models have enabled remarkable progress in image and video generation for avatar reconstruction, existing methods primarily rely on 2D priors and struggle to achieve consistent 3D geometry. We propose a novel framework that leverages geometry-aware diffusion to learn strong geometry priors for high-fidelity head avatar reconstruction. Our approach jointly synthesizes portrait images and corresponding surface normals, while a pose-free expression encoder captures implicit expression representations. Both synthesized images and expression latents are incorporated into 3D Gaussian-based avatars, enabling photorealistic rendering with accurate geometry. Extensive experiments demonstrate that our method substantially outperforms state-of-the-art approaches in visual quality, expression fidelity, and cross-identity generalization, while supporting real-time rendering.
翻译:从单张肖像图像重建具有照片级真实感且可动画化的四维头部虚拟人,始终是计算机视觉领域的一项基础性挑战。尽管扩散模型在面向虚拟人重建的图像与视频生成方面取得了显著进展,但现有方法主要依赖于二维先验,难以实现一致的三维几何结构。我们提出了一种新颖的框架,该框架利用几何感知扩散来学习用于高保真头部虚拟人重建的强几何先验。我们的方法联合合成肖像图像及对应的表面法线图,同时一个无姿态的表情编码器捕获隐式的表情表征。合成图像与表情潜在编码均被整合到基于三维高斯分布的虚拟人中,从而实现具有精确几何结构的光照真实感渲染。大量实验表明,我们的方法在视觉质量、表情保真度以及跨身份泛化能力上均显著优于现有最先进方法,同时支持实时渲染。