We present EgoHumans, a new multi-view multi-human video benchmark to advance the state-of-the-art of egocentric human 3D pose estimation and tracking. Existing egocentric benchmarks either capture single subject or indoor-only scenarios, which limit the generalization of computer vision algorithms for real-world applications. We propose a novel 3D capture setup to construct a comprehensive egocentric multi-human benchmark in the wild with annotations to support diverse tasks such as human detection, tracking, 2D/3D pose estimation, and mesh recovery. We leverage consumer-grade wearable camera-equipped glasses for the egocentric view, which enables us to capture dynamic activities like playing soccer, fencing, volleyball, etc. Furthermore, our multi-view setup generates accurate 3D ground truth even under severe or complete occlusion. The dataset consists of more than 125k egocentric images, spanning diverse scenes with a particular focus on challenging and unchoreographed multi-human activities and fast-moving egocentric views. We rigorously evaluate existing state-of-the-art methods and highlight their limitations in the egocentric scenario, specifically on multi-human tracking. To address such limitations, we propose EgoFormer, a novel approach with a multi-stream transformer architecture and explicit 3D spatial reasoning to estimate and track the human pose. EgoFormer significantly outperforms prior art by 13.6% IDF1 and 9.3 HOTA on the EgoHumans dataset.
翻译:我们提出了EgoHumans——一个新的多视角多人视频基准数据集,旨在推动以自我为中心的人体3D姿态估计与追踪技术的前沿发展。现有以自我为中心的基准数据集要么仅针对单个主体,要么局限于室内场景,这限制了计算机视觉算法在真实世界应用中的泛化能力。我们提出了一种新颖的3D捕捉设置,用于在野外场景中构建一个全面的以自我为中心的多人基准数据集,并配备了支持多种任务(如人体检测、追踪、2D/3D姿态估计及网格恢复)的标注信息。我们利用消费级可穿戴相机眼镜获取以自我为中心的视角,从而能够捕捉如足球、击剑、排球等动态活动。此外,我们的多视角设置即使在严重或完全遮挡的情况下也能生成准确的3D真实标注。该数据集包含超过12.5万张以自我为中心的图像,覆盖了多样化的场景,特别聚焦于具有挑战性且未经编排的多人活动以及快速移动的自我中心视角。我们严格评估了现有最先进的方法,并指出了它们在以自我为中心的场景(尤其是多人追踪任务)中的局限性。为解决这些局限性,我们提出了EgoFormer——一种新颖的方法,采用多流Transformer架构并显式进行3D空间推理,以估计和追踪人体姿态。在EgoHumans数据集上,EgoFormer在IDF1和HOTA指标上分别显著超越先前最优方法13.6%和9.3%。