A photorealistic and immersive human avatar experience demands capturing fine, person-specific details such as cloth and hair dynamics, subtle facial expressions, and characteristic motion patterns. Achieving this requires large, high-quality datasets, which often introduce ambiguities and spurious correlations when very similar poses correspond to different appearances. Models that fit these details during training can overfit and produce unstable, abrupt appearance changes for novel poses. We propose a 3D Gaussian Splatting avatar model with a spatial MLP backbone that is conditioned on both pose and an appearance latent. The latent is learned during training by an encoder, yielding a compact representation that improves reconstruction quality and helps disambiguate pose-driven renderings. At driving time, our predictor autoregressively infers the latent, producing temporally smooth appearance evolution and improved stability. Overall, our method delivers a robust and practical path to high-fidelity, stable avatar driving.
翻译:实现逼真且沉浸式的人体虚拟化身体验,需要捕捉衣物与头发动态、细微面部表情及特征性运动模式等个性化细节。这需要大规模高质量数据集,但数据中相似姿态对应不同外观的情况频繁出现,常引入歧义性和虚假相关性。若训练过程中直接拟合这些细节,模型可能过拟合,并在处理新姿态时产生不稳定的突变外观。我们提出一种基于3D高斯泼溅的虚拟化身模型,采用空间多层感知机(MLP)主干网络,同时以姿态和外观潜变量为条件。该潜变量通过编码器在训练过程中学习得到,形成紧凑表示,可提升重建质量并消除姿态驱动的渲染歧义。在驱动时,我们的预测器通过自回归方式推断该潜变量,产生时间平滑的外观演化并增强稳定性。总之,本方法为高质量、稳定的虚拟化身驱动提供了一条稳健且实用的路径。