We present SynShot, a novel method for the few-shot inversion of a drivable head avatar based on a synthetic prior. We tackle three major challenges. First, training a controllable 3D generative network requires a large number of diverse sequences, for which pairs of images and high-quality tracked meshes are not always available. Second, the use of real data is strictly regulated (e.g., under the General Data Protection Regulation, which mandates frequent deletion of models and data to accommodate a situation when a participant's consent is withdrawn). Synthetic data, free from these constraints, is an appealing alternative. Third, state-of-the-art monocular avatar models struggle to generalize to new views and expressions, lacking a strong prior and often overfitting to a specific viewpoint distribution. Inspired by machine learning models trained solely on synthetic data, we propose a method that learns a prior model from a large dataset of synthetic heads with diverse identities, expressions, and viewpoints. With few input images, SynShot fine-tunes the pretrained synthetic prior to bridge the domain gap, modeling a photorealistic head avatar that generalizes to novel expressions and viewpoints. We model the head avatar using 3D Gaussian splatting and a convolutional encoder-decoder that outputs Gaussian parameters in UV texture space. To account for the different modeling complexities over parts of the head (e.g., skin vs hair), we embed the prior with explicit control for upsampling the number of per-part primitives. Compared to SOTA monocular and GAN-based methods, SynShot significantly improves novel view and expression synthesis.
翻译:我们提出SynShot,一种基于合成先验进行少样本可驱动头部化身反演的新方法。我们解决了三大挑战。首先,训练可控的3D生成网络需要大量多样化序列,而图像与高质量跟踪网格的配对数据并不总是可得。其次,真实数据的使用受到严格监管(例如根据《通用数据保护条例》,当参与者撤回同意时,需频繁删除模型和数据以应对该情况)。不受这些约束的合成数据是一个有吸引力的替代方案。第三,最先进的单目化身模型难以泛化到新视角和表情,缺乏强先验且常过度拟合特定视角分布。受仅使用合成数据训练的机器学习模型启发,我们提出一种方法,从包含多样化身份、表情和视角的大规模合成头部数据集中学习先验模型。在仅需少量输入图像的情况下,SynShot通过微调预训练的合成先验来弥合域间差距,从而建模能够泛化到新表情和新视角的光照真实头部化身。我们使用3D高斯抛雪球和卷积编码器-解码器对头部化身进行建模,该解码器在UV纹理空间中输出高斯参数。为应对头部不同区域(如皮肤与头发)的建模复杂度差异,我们在先验中嵌入显式控制机制,以实现按部位基元的数量上采样。与最先进的单目方法及基于GAN的方法相比,SynShot在新视角和新表情合成方面取得显著提升。