Tracking the full body motions of users in XR (AR/VR) devices is a fundamental challenge to bring a sense of authentic social presence. Due to the absence of dedicated leg sensors, currently available body tracking methods adopt a synthesis approach to generate plausible motions given a 3-point signal from the head and controller tracking. In order to enable mixed reality features, modern XR devices are capable of estimating depth information of the headset surroundings using available sensors combined with dedicated machine learning models. Such egocentric depth sensing cannot drive the body directly, as it is not registered and is incomplete due to limited field-of-view and body self-occlusions. For the first time, we propose to leverage the available depth sensing signal combined with self-supervision to learn a multi-modal pose estimation model capable of tracking full body motions in real time on XR devices. We demonstrate how current 3-point motion synthesis models can be extended to point cloud modalities using a semantic point cloud encoder network combined with a residual network for multi-modal pose estimation. These modules are trained jointly in a self-supervised way, leveraging a combination of real unregistered point clouds and simulated data obtained from motion capture. We compare our approach against several state-of-the-art systems for XR body tracking and show that our method accurately tracks a diverse range of body motions. XR-MBT tracks legs in XR for the first time, whereas traditional synthesis approaches based on partial body tracking are blind.
翻译:在XR(AR/VR)设备中追踪用户的全身运动是营造真实社交临场感的基础性挑战。由于缺乏专用的腿部传感器,现有身体追踪方法采用合成策略,基于头显与控制器追踪产生的三点信号生成合理的运动。为实现混合现实功能,现代XR设备能够利用现有传感器结合专用机器学习模型,估计头显周围环境的深度信息。此类以自我为中心的深度感知无法直接驱动身体模型,因其未经过配准且受限于视场角与身体自遮挡而不完整。本研究首次提出利用现有深度感知信号结合自监督,学习一种能够在XR设备上实时追踪全身运动的多模态姿态估计模型。我们展示了如何通过语义点云编码网络与多模态姿态估计残差网络的结合,将当前三点运动合成模型扩展至点云模态。这些模块以自监督方式联合训练,综合利用真实未配准点云与从动作捕捉数据获得的模拟数据。我们将所提方法与多种XR身体追踪前沿系统进行比较,结果表明本方法能准确追踪多样化的身体运动。XR-MBT首次在XR中实现了腿部追踪,而基于局部身体追踪的传统合成方法对此完全无法感知。