Despite recent progress in 3D Gaussian-based head avatar modeling, efficiently generating high fidelity avatars remains a challenge. Current methods typically rely on extensive multi-view capture setups or monocular videos with per-identity optimization during inference, limiting their scalability and ease of use on unseen subjects. To overcome these efficiency drawbacks, we propose FastGHA, a feed-forward method to generate high-quality Gaussian head avatars from only a few input images while supporting real-time animation. Our approach directly learns a per-pixel Gaussian representation from the input images, and aggregates multi-view information using a transformer-based encoder that fuses image features from both DINOv3 and Stable Diffusion VAE. For real-time animation, we extend the explicit Gaussian representations with per-Gaussian features and introduce a lightweight MLP-based dynamic network to predict 3D Gaussian deformations from expression codes. Furthermore, to enhance geometric smoothness of the 3D head, we employ point maps from a pre-trained large reconstruction model as geometry supervision. Experiments show that our approach significantly outperforms existing methods in both rendering quality and inference efficiency, while supporting real-time dynamic avatar animation.
翻译:尽管基于3D高斯的头部化身建模近期取得了进展,但高效生成高保真化身仍是一个挑战。现有方法通常依赖于复杂的多视角采集系统,或在推理时需要对每个身份进行单目视频优化,这限制了其在未见对象上的可扩展性和易用性。为克服这些效率缺陷,我们提出了FastGHA,一种仅需少量输入图像即可生成高质量高斯头部化身的前馈方法,并支持实时动画。我们的方法直接从输入图像学习逐像素的高斯表示,并利用一个基于Transformer的编码器聚合多视角信息,该编码器融合了来自DINOv3和Stable Diffusion VAE的图像特征。为实现实时动画,我们为显式高斯表示扩展了逐高斯特征,并引入一个轻量级的基于MLP的动态网络,以从表情编码预测3D高斯形变。此外,为提升3D头部的几何平滑性,我们采用预训练大型重建模型生成的点云图作为几何监督。实验表明,我们的方法在渲染质量和推理效率上均显著优于现有方法,同时支持实时的动态化身动画。