Despite recent progress in 3D Gaussian-based head avatar modeling, efficiently generating high fidelity avatars remains a challenge. Current methods typically rely on extensive multi-view capture setups or monocular videos with per-identity optimization during inference, limiting their scalability and ease of use on unseen subjects. To overcome these efficiency drawbacks, we propose \OURS, a feed-forward method to generate high-quality Gaussian head avatars from only a few input images while supporting real-time animation. Our approach directly learns a per-pixel Gaussian representation from the input images, and aggregates multi-view information using a transformer-based encoder that fuses image features from both DINOv3 and Stable Diffusion VAE. For real-time animation, we extend the explicit Gaussian representations with per-Gaussian features and introduce a lightweight MLP-based dynamic network to predict 3D Gaussian deformations from expression codes. Furthermore, to enhance geometric smoothness of the 3D head, we employ point maps from a pre-trained large reconstruction model as geometry supervision. Experiments show that our approach significantly outperforms existing methods in both rendering quality and inference efficiency, while supporting real-time dynamic avatar animation.
翻译:尽管基于3D高斯的头部化身建模近期取得了进展,高效生成高保真化身仍具挑战。现有方法通常依赖于复杂的多视角采集系统,或在推理时需对每个身份进行单目视频优化,这限制了其在未见对象上的可扩展性与易用性。为克服这些效率缺陷,我们提出FastGHA,一种仅需少量输入图像即可生成高质量高斯头部化身的前馈式方法,并支持实时动画。我们的方法直接从输入图像学习逐像素的高斯表示,并利用基于Transformer的编码器融合DINOv3与Stable Diffusion VAE的图像特征以聚合多视角信息。为实现实时动画,我们为显式高斯表示扩展了逐高斯特征,并引入轻量级基于MLP的动态网络以根据表情编码预测3D高斯形变。此外,为提升3D头部的几何平滑度,我们采用预训练大型重建模型生成的点云图作为几何监督。实验表明,本方法在渲染质量与推理效率上均显著优于现有方法,同时支持实时动态化身动画。