Latent Dynamics for Full Body Avatar Animation

Pose-driven full-body avatars built on neural rendering produce high-quality novel views of a captured subject. Yet loose clothing and other dynamic elements deform in ways pose alone cannot explain: the same pose can correspond to many different states, because their motion depends on history, inertia, and contact. Explicit simulation and layered-garment methods can model such dynamics, but they require either a dedicated garment template, which raw multi-view capture does not naturally provide, or a test-time physics simulator with non-trivial runtime cost. A parallel line of work learns data-driven clothing avatars that avoid explicit garment layers. These methods add an auxiliary latent for variation beyond pose; at inference, they fix it, regress it from pose, or retrieve it from training data, without explicitly modeling how the latent evolves with its own dynamics. Additionally, even in everyday motion with loose clothing, existing architectures often struggle to capture fine-grained detail, producing blurry renderings and temporal artifacts. We augment a pose-conditioned 3D Gaussian avatar with a transformer-based decoder and a dynamics residual latent that captures temporal appearance and geometry variation beyond the driving signals. At inference, a learned latent dynamics model evolves the residual latent from a short pose history and the previous latent state. The model decomposes each update into driving, restoring, and dissipative forces, producing temporally coherent, history-dependent rollouts with negligible added cost. Different initial conditions yield diverse yet plausible motion trajectories, and the force decomposition exposes controls such as stiffness. Across nine captured sequences of everyday motion with diverse loose garments, quantitative metrics and a perceptual user study show improved animation quality over recent data-driven baselines.

翻译：基于神经渲染的全身姿态驱动虚拟人模型能够生成高质量的被拍摄对象新视角图像。然而，宽松衣物及其他动态元素会以姿态单独无法解释的方式发生形变：相同姿态可能对应多种不同状态，因为其运动依赖历史信息、惯性及接触作用。显式模拟与分层衣物方法可建模此类动态，但需要专属衣物模板（原始多视角捕捉数据无法天然提供）或运行时成本高昂的测试阶段物理模拟器。另一类并行研究致力于学习数据驱动的无显式衣物层虚拟人模型。这些方法通过添加辅助隐变量来捕捉姿态之外的变异性；在推理阶段，它们固定隐变量、从姿态回归隐变量或从训练数据中检索隐变量，而未显式建模隐变量如何随时间动态演化。此外，即使在包含宽松衣物的日常运动中，现有架构仍难以捕捉精细细节，导致渲染模糊与时间伪影。我们提出一种集成了Transformer解码器与动态残差隐变量的姿态条件3D高斯虚拟人模型，该残差隐变量可捕捉驱动信号之外的时间表观与几何变化。在推理阶段，学习到的隐式动态模型通过短时姿态历史与前一隐变量状态来演化残差隐变量。该模型将每次更新分解为驱动力、恢复力与耗散力，以可忽略的额外计算成本生成时间连贯且具有历史依赖性的推演序列。不同初始条件可产生多样且合理的运动轨迹，力分解机制还支持刚度等参数控制。通过九个包含各式宽松衣物的日常运动捕捉序列验证，定量指标与感知用户研究均表明，本方法相较近期数据驱动基线模型在动画质量上取得显著提升。