We introduce ShapeGaussian, a high-fidelity, template-free method for 4D human reconstruction from casual monocular videos. Generic reconstruction methods lacking robust vision priors, such as 4DGS, struggle to capture high-deformation human motion without multi-view cues. While template-based approaches, primarily relying on SMPL, such as HUGS, can produce photorealistic results, they are highly susceptible to errors in human pose estimation, often leading to unrealistic artifacts. In contrast, ShapeGaussian effectively integrates template-free vision priors to achieve both high-fidelity and robust scene reconstructions. Our method follows a two-step pipeline: first, we learn a coarse, deformable geometry using pretrained models that estimate data-driven priors, providing a foundation for reconstruction. Then, we refine this geometry using a neural deformation model to capture fine-grained dynamic details. By leveraging 2D vision priors, we mitigate artifacts from erroneous pose estimation in template-based methods and employ multiple reference frames to resolve the invisibility issue of 2D keypoints in a template-free manner. Extensive experiments demonstrate that ShapeGaussian surpasses template-based methods in reconstruction accuracy, achieving superior visual quality and robustness across diverse human motions in casual monocular videos.
翻译:我们提出了ShapeGaussian,一种从日常单目视频中进行高保真、无模板的4D人体重建方法。缺乏鲁棒视觉先验的通用重建方法(如4DGS)在没有多视角线索的情况下难以捕捉高变形的人体运动。而主要依赖SMPL的基于模板的方法(如HUGS)虽然能产生逼真的结果,但极易受到人体姿态估计误差的影响,常常导致不真实的伪影。相比之下,ShapeGaussian有效地集成了无模板的视觉先验,以实现高保真且鲁棒的场景重建。我们的方法遵循一个两步流程:首先,我们利用预训练模型学习一个粗略的可变形几何,这些模型能估计数据驱动的先验,为重建提供基础。然后,我们使用一个神经变形模型来细化该几何,以捕捉细粒度的动态细节。通过利用2D视觉先验,我们减轻了基于模板方法中因姿态估计错误而产生的伪影,并采用多参考帧以无模板的方式解决2D关键点不可见的问题。大量实验表明,ShapeGaussian在重建精度上超越了基于模板的方法,在日常单目视频的多样化人体运动中实现了更优的视觉质量和鲁棒性。