Human Mesh Recovery (HMR) aims to reconstruct 3D human pose and shape from 2D observations and is fundamental to human-centric understanding in real-world scenarios. While recent image-based HMR methods such as SAM 3D Body achieve strong robustness on in-the-wild images, they rely on per-frame inference when applied to videos, leading to temporal inconsistency and degraded performance under occlusions. We address these issues without extra training by leveraging the inherent human continuity in videos. We propose SAM-Body4D, a training-free framework for temporally consistent and occlusion-robust HMR from videos. We first generate identity-consistent masklets using a promptable video segmentation model, then refine them with an Occlusion-Aware module to recover missing regions. The refined masklets guide SAM 3D Body to produce consistent full-body mesh trajectories, while a padding-based parallel strategy enables efficient multi-human inference. Experimental results demonstrate that SAM-Body4D achieves improved temporal stability and robustness in challenging in-the-wild videos, without any retraining. Our code and demo are available at: https://github.com/gaomingqi/sam-body4d.
翻译:人体网格恢复旨在从二维观测中重建三维人体姿态与形状,是现实场景中人本理解的基础任务。尽管近期基于图像的方法(如SAM 3D Body)在真实场景图像上展现出强鲁棒性,但在处理视频时依赖逐帧推理,导致时序不一致性及遮挡条件下性能下降。我们通过利用视频中人体固有的连续性,在不增加额外训练的前提下解决这些问题。本文提出SAM-Body4D,一种无需训练的框架,用于从视频中实现时序一致且抗遮挡的人体网格恢复。我们首先使用可提示视频分割模型生成身份一致的掩码片段,随后通过遮挡感知模块进行细化以修复缺失区域。优化后的掩码片段引导SAM 3D Body生成连贯的全身网格轨迹,而基于填充的并行策略实现了高效的多人体推理。实验结果表明,SAM-Body4D在具有挑战性的真实场景视频中显著提升了时序稳定性与鲁棒性,且无需任何重新训练。代码与演示已开源:https://github.com/gaomingqi/sam-body4d。