We present Better Together, a method that simultaneously solves the human pose estimation problem while reconstructing a photorealistic 3D human avatar from multi-view videos. While prior art usually solves these problems separately, we argue that joint optimization of skeletal motion with a 3D renderable body model brings synergistic effects, i.e. yields more precise motion capture and improved visual quality of real-time rendering of avatars. To achieve this, we introduce a novel animatable avatar with 3D Gaussians rigged on a personalized mesh and propose to optimize the motion sequence with time-dependent MLPs that provide accurate and temporally consistent pose estimates. We first evaluate our method on highly challenging yoga poses and demonstrate state-of-the-art accuracy on multi-view human pose estimation, reducing error by 35% on body joints and 45% on hand joints compared to keypoint-based methods. At the same time, our method significantly boosts the visual quality of animatable avatars (+2dB PSNR on novel view synthesis) on diverse challenging subjects.
翻译:我们提出"更优协同"方法,该方法能够从多视角视频中同时解决人体姿态估计问题并重建逼真的三维人体化身。现有技术通常分别处理这些问题,我们认为将骨骼运动与三维可渲染人体模型进行联合优化能产生协同效应,即实现更精确的运动捕捉并提升化身实时渲染的视觉质量。为此,我们提出一种新型可动画化身,其将三维高斯分布装配于个性化网格之上,并通过时间依赖的多层感知机优化运动序列,从而提供准确且时序一致的姿态估计。我们首先在极具挑战性的瑜伽姿势上评估本方法,在多视角人体姿态估计任务中实现了最先进的精度,相较于基于关键点的方法,在身体关节和手部关节上的误差分别降低了35%和45%。同时,本方法在多样化挑战性对象上显著提升了可动画化身的视觉质量(新视角合成任务中PSNR提升+2dB)。