Scene-Aware 3D Multi-Human Motion Capture from a Single Camera

from arxiv, Accepted to Eurographics 2023. See also github: https://github.com/dluvizon/scene-aware-3d-multi-human project page: https://github.com/dluvizon/scene-aware-3d-multi-human

In this work, we consider the problem of estimating the 3D position of multiple humans in a scene as well as their body shape and articulation from a single RGB video recorded with a static camera. In contrast to expensive marker-based or multi-view systems, our lightweight setup is ideal for private users as it enables an affordable 3D motion capture that is easy to install and does not require expert knowledge. To deal with this challenging setting, we leverage recent advances in computer vision using large-scale pre-trained models for a variety of modalities, including 2D body joints, joint angles, normalized disparity maps, and human segmentation masks. Thus, we introduce the first non-linear optimization-based approach that jointly solves for the absolute 3D position of each human, their articulated pose, their individual shapes as well as the scale of the scene. In particular, we estimate the scene depth and person unique scale from normalized disparity predictions using the 2D body joints and joint angles. Given the per-frame scene depth, we reconstruct a point-cloud of the static scene in 3D space. Finally, given the per-frame 3D estimates of the humans and scene point-cloud, we perform a space-time coherent optimization over the video to ensure temporal, spatial and physical plausibility. We evaluate our method on established multi-person 3D human pose benchmarks where we consistently outperform previous methods and we qualitatively demonstrate that our method is robust to in-the-wild conditions including challenging scenes with people of different sizes.

翻译：在本工作中，我们考虑从静态单目RGB视频中估计场景中多个人的3D位置、体形及关节姿态的问题。与昂贵的标记式或多视角系统相比，我们的轻量级设置对个人用户非常理想，因为它提供了一种价格低廉、易于安装且无需专业知识的3D运动捕捉方案。为应对这一具有挑战性的场景，我们借助计算机视觉的最新进展，利用大规模预训练模型处理多种模态数据，包括2D身体关节点、关节角度、归一化视差图及人体分割掩码。由此，我们首次提出基于非线性优化的方法，能够联合求解每个人的绝对3D位置、关节姿态、个体体形及场景尺度。具体而言，我们从归一化视差预测中，结合2D身体关节点与关节角度，估计场景深度和每个人物体的独有尺度。基于逐帧场景深度，我们在3D空间中重建静态场景的点云。最后，依据逐帧的人体3D估计与场景点云，我们对视频执行时空一致性优化，以确保时间、空间及物理合理性。我们在已有的多人3D人体姿态基准测试上评估了该方法，结果一致优于先前方法，并通过定性实验证明该方法对野外复杂场景（包含不同体型的人物）具有鲁棒性。