Estimating 3D hand pose from monocular RGB images is fundamental for applications in AR/VR, human-computer interaction, and sign language understanding. In this work we focus on a scenario where a discrete set of gesture labels is available and show that gesture semantics can serve as a powerful inductive bias for 3D pose estimation. We present a two-stage framework: gesture-aware pretraining that learns an informative embedding space using coarse and fine gesture labels from InterHand2.6M, followed by a per-joint token Transformer guided by gesture embeddings as intermediate representations for final regression of MANO hand parameters. Training is driven by a layered objective over parameters, joints, and structural constraints. Experiments on InterHand2.6M demonstrate that gesture-aware pretraining consistently improves single-hand accuracy over the state-of-the-art EANet baseline, and that the benefit transfers across architectures without any modification.
翻译:从单目RGB图像中估计三维手部姿态是增强现实/虚拟现实、人机交互及手语理解等应用的基础。本文聚焦于可获取离散手势标签集这一场景,论证了手势语义可作为三维姿态估计的有效归纳偏置。我们提出两阶段框架:首先利用InterHand2.6M数据集的粗粒度与细粒度手势标签进行手势感知预训练,学习信息丰富的嵌入空间;随后构建基于逐关节Token的Transformer,以手势嵌入作为中间表征引导MANO手部参数的最终回归。训练采用覆盖参数、关节点及结构约束的分层目标函数。在InterHand2.6M上的实验表明,手势感知预训练能持续提升单只手部姿态估计精度,超越当前最先进的EANet基线,且该优势无需任何修改即可跨架构迁移。