The proliferation of XR devices has made egocentric hand pose estimation a vital task, yet this perspective is inherently challenged by frequent finger occlusions. To address this, we propose a novel approach that leverages the rich information in dorsal hand skin deformation, unlocked by recent advances in dense visual featurizers. We introduce a dual-stream delta encoder that learns pose by contrasting features from a dynamic hand with a baseline relaxed position. Our evaluation demonstrates that, using only cropped dorsal images, our method reduces the Mean Per Joint Angle Error (MPJAE) by 18% in self-occluded scenarios (fingers >=50% occluded) compared to state-of-the-art techniques that depend on the whole hand's geometry and large model backbones. Consequently, our method not only enhances the reliability of downstream tasks like index finger pinch and tap estimation in occluded scenarios but also unlocks new interaction paradigms, such as detecting isometric force for a surface "click" without visible movement while minimizing model size.
翻译:随着XR设备的普及,第一人称视角下的手部姿态估计已成为一项关键任务,但该视角固有地面临频繁手指遮挡的挑战。为解决此问题,我们提出一种新颖方法,利用手背皮肤形变所蕴含的丰富信息,该信息得益于近期密集视觉特征提取器的进展而得以有效提取。我们引入一种双流差分编码器,通过对比动态手部与基准放松姿态的特征来学习姿态。评估结果表明,仅使用裁剪后的手背图像,在自遮挡场景(手指遮挡率≥50%)中,我们的方法将平均关节角度误差(MPJAE)降低了18%,优于依赖完整手部几何信息及大型模型骨干网络的最先进技术。因此,我们的方法不仅提升了遮挡场景下食指捏合与点击估计等下游任务的可靠性,还解锁了新的交互范式,例如在无可见运动时检测等长肌力以实现表面“点击”检测,同时最大限度地减小了模型规模。