Most advances in human mesh recovery (HMR) have focused on pelvis-centered recovery, overlooking metric 3D localization and detection accuracy in the camera coordinate system - two key factors for real-world applications such as human-robot interaction and social scene understanding. Current evaluation protocols often ignore these aspects, emphasizing per-person, root-centered recovery rather than camera-space perception. As a result, existing approaches rely on fixed camera assumptions or handcrafted post-processing, limiting their robustness and practical deployment. We introduce Multi-HMR 2, a simple yet robust DETR-based framework for Multi-person Camera-centric Human detection, mesh Recovery, and tracking. Multi-HMR 2 predicts a scene-consistent camera together with human meshes, enabling metric 3D localization without ground-truth intrinsics. Moreover, by distilling image-based memory features from SAM2, Multi-HMR 2 extends to tracking, achieving consistent identity association without video supervision. Despite its conceptual simplicity - no handcrafted components, no video input, and no ground-truth cameras - Multi-HMR 2 achieves state-of-the-art pelvis-centered performance while substantially improving detection accuracy and metric 3D localization.
翻译:人体网格重建(HMR)领域的大多数进展聚焦于骨盆中心重建,忽略了相机坐标系中的度量三维定位与检测精度——这两个因素在人机交互、社会场景理解等实际应用中至关重要。当前评估协议常忽视这些方面,侧重于单人根节点中心重建而非相机空间感知。因此,现有方法依赖固定相机假设或人工后处理,限制了其鲁棒性与实际部署能力。我们提出Multi-HMR 2,一种基于DETR的简洁鲁棒框架,用于多人相机中心人体检测、网格重建与跟踪。Multi-HMR 2联合预测场景一致相机与人体网格,无需真实内参即可实现度量三维定位。此外,通过从SAM2中蒸馏基于图像的记忆特征,Multi-HMR 2扩展至跟踪功能,无需视频监督即可实现一致的身份关联。尽管概念简洁(无人工组件、无视频输入、无真实相机),Multi-HMR 2在保持骨盆中心最先进性能的同时,显著提升了检测精度与度量三维定位能力。