While head-mounted devices are becoming more compact, they provide egocentric views with significant self-occlusions of the device user. Hence, existing methods often fail to accurately estimate complex 3D poses from egocentric views. In this work, we propose a new transformer-based framework to improve egocentric stereo 3D human pose estimation, which leverages the scene information and temporal context of egocentric stereo videos. Specifically, we utilize 1) depth features from our 3D scene reconstruction module with uniformly sampled windows of egocentric stereo frames, and 2) human joint queries enhanced by temporal features of the video inputs. Our method is able to accurately estimate human poses even in challenging scenarios, such as crouching and sitting. Furthermore, we introduce two new benchmark datasets, i.e., UnrealEgo2 and UnrealEgo-RW (RealWorld). The proposed datasets offer a much larger number of egocentric stereo views with a wider variety of human motions than the existing datasets, allowing comprehensive evaluation of existing and upcoming methods. Our extensive experiments show that the proposed approach significantly outperforms previous methods. We will release UnrealEgo2, UnrealEgo-RW, and trained models on our project page.
翻译:尽管头戴设备日益紧凑,但其提供的自我中心视角常导致设备用户产生显著的自遮挡现象。因此,现有方法往往难以准确估计来自自我中心视角的复杂三维姿态。本文提出了一种基于Transformer的新型框架,通过利用自我中心立体视频的场景信息与时间上下文,改进了自我中心立体三维人体姿态估计。具体而言,我们采用:1)基于三维场景重建模块从均匀采样的自我中心立体帧窗口中提取的深度特征;2)经视频输入时间特征增强的人体关节点查询。我们的方法即使在蹲姿、坐姿等挑战性场景中也能准确估计人体姿态。此外,我们引入两个新基准数据集——UnrealEgo2与UnrealEgo-RW(真实世界)。相较于现有数据集,所提数据集提供了数量更多、人体动作种类更丰富的自我中心立体视角,能够对现有及新兴方法进行综合评估。大量实验表明,本方法显著优于以往技术。我们将于项目主页发布UnrealEgo2、UnrealEgo-RW数据集及训练模型。