Egocentric human motion estimation is essential for AR/VR experiences, yet remains challenging due to limited body coverage from the egocentric viewpoint, frequent occlusions, and scarce labeled data. We present EgoPoseFormer v2, a method that addresses these challenges through two key contributions: (1) a transformer-based model for temporally consistent and spatially grounded body pose estimation, and (2) an auto-labeling system that enables the use of large unlabeled datasets for training. Our model is fully differentiable, introduces identity-conditioned queries, multi-view spatial refinement, causal temporal attention, and supports both keypoints and parametric body representations under a constant compute budget. The auto-labeling system scales learning to tens of millions of unlabeled frames via uncertainty-aware semi-supervised training. The system follows a teacher-student schema to generate pseudo-labels and guide training with uncertainty distillation, enabling the model to generalize to different environments. On the EgoBody3M benchmark, with a 0.8 ms latency on GPU, our model outperforms two state-of-the-art methods by 12.2% and 19.4% in accuracy, and reduces temporal jitter by 22.2% and 51.7%. Furthermore, our auto-labeling system further improves the wrist MPJPE by 13.1%.
翻译:第一人称人体运动估计对于增强现实/虚拟现实体验至关重要,但由于第一人称视角下身体覆盖范围有限、频繁遮挡以及标注数据稀缺,该任务仍具挑战性。我们提出EgoPoseFormer v2方法,通过两项关键贡献应对这些挑战:(1)一种基于Transformer的模型,用于实现时间一致且空间锚定的身体姿态估计;(2)一种自动标注系统,使得能够利用大规模未标注数据集进行训练。我们的模型完全可微分,引入了身份条件查询、多视角空间细化、因果时序注意力机制,并在恒定计算预算下同时支持关键点与参数化身体表征。该自动标注系统通过不确定性感知的半监督训练,将学习规模扩展至数千万未标注帧。系统遵循师生架构生成伪标签,并利用不确定性蒸馏指导训练,使模型能够泛化至不同环境。在EgoBody3M基准测试中,我们的模型在GPU上仅需0.8毫秒延迟,其精度超越两种最先进方法达12.2%与19.4%,并将时序抖动降低了22.2%与51.7%。此外,我们的自动标注系统进一步将手腕MPJPE指标提升了13.1%。