Estimating 3D full-body avatars from AR/VR devices is essential for creating immersive experiences in AR/VR applications. This task is challenging due to the limited input from Head Mounted Devices, which capture only sparse observations from the head and hands. Predicting the full-body avatars, particularly the lower body, from these sparse observations presents significant difficulties. In this paper, we are inspired by the inherent property of the kinematic tree defined in the Skinned Multi-Person Linear (SMPL) model, where the upper body and lower body share only one common ancestor node, bringing the potential of decoupled reconstruction. We propose a stratified approach to decouple the conventional full-body avatar reconstruction pipeline into two stages, with the reconstruction of the upper body first and a subsequent reconstruction of the lower body conditioned on the previous stage. To implement this straightforward idea, we leverage the latent diffusion model as a powerful probabilistic generator, and train it to follow the latent distribution of decoupled motions explored by a VQ-VAE encoder-decoder model. Extensive experiments on AMASS mocap dataset demonstrate our state-of-the-art performance in the reconstruction of full-body motions.
翻译:从AR/VR设备估计三维全身虚拟人对于在AR/VR应用中创造沉浸式体验至关重要。由于头戴式设备仅能捕获头部和手部的稀疏观测数据,该任务面临输入信息有限的挑战。从这些稀疏观测中预测全身虚拟人(尤其是下半身)存在显著困难。本文受到Skinned Multi-Person Linear(SMPL)模型中定义的运动学树固有特性的启发:上半身与下半身仅共享一个共同祖先节点,这为解耦重建提供了潜在可能。我们提出一种分层化方法,将传统的全身虚拟人重建流程解耦为两个阶段:先重建上半身,再基于前一阶段的条件重建下半身。为实现这一直接构想,我们利用潜在扩散模型作为强大的概率生成器,并通过训练使其遵循由VQ-VAE编码器-解码器模型探索的解耦运动潜在分布。在AMASS动作捕捉数据集上的大量实验证明了我们在全身运动重建方面达到的先进性能。