We introduce ELITE, an Efficient Gaussian head avatar synthesis from a monocular video via Learned Initialization and TEst-time generative adaptation. Prior works rely either on a 3D data prior or a 2D generative prior to compensate for missing visual cues in monocular videos. However, 3D data prior methods often struggle to generalize in-the-wild, while 2D generative prior methods are computationally heavy and prone to identity hallucination. We identify a complementary synergy between these two priors and design an efficient system that achieves high-fidelity animatable avatar synthesis with strong in-the-wild generalization. Specifically, we introduce a feed-forward Mesh2Gaussian Prior Model (MGPM) that enables fast initialization of a Gaussian avatar. To further bridge the domain gap at test time, we design a test-time generative adaptation stage, leveraging both real and synthetic images as supervision. Unlike previous full diffusion denoising strategies that are slow and hallucination-prone, we propose a rendering-guided single-step diffusion enhancer that restores missing visual details, grounded on Gaussian avatar renderings. Our experiments demonstrate that ELITE produces visually superior avatars to prior works, even for challenging expressions, while achieving 60x faster synthesis than the 2D generative prior method.
翻译:我们提出ELITE,一种通过学习初始化与测试时生成适配从单目视频实现高效高斯头部化身合成的方法。现有工作通常依赖3D数据先验或2D生成先验来补偿单目视频中缺失的视觉线索。然而,基于3D数据先验的方法往往难以泛化至真实场景,而基于2D生成先验的方法则计算负担沉重且易产生身份幻觉。我们发现了这两种先验之间的互补协同效应,并设计了一个高效系统,能够实现高保真度可驱动化身合成,同时具备强大的真实场景泛化能力。具体而言,我们提出一种前馈式网格到高斯先验模型(MGPM),可实现高斯化身的快速初始化。为在测试时进一步弥合域间差距,我们设计了测试时生成适配阶段,利用真实图像与合成图像共同作为监督信号。不同于以往缓慢且易产生幻觉的完整扩散去噪策略,我们提出一种渲染引导的单步扩散增强器,该方法基于高斯化身渲染结果,能够恢复缺失的视觉细节。实验表明,ELITE生成的化身在视觉质量上优于现有方法,即使对于挑战性表情也能保持优异表现,同时其合成速度比基于2D生成先验的方法快60倍。