We present V-HPOT, a novel approach for improving the cross-domain performance of 3D hand pose estimation from egocentric images across diverse, unseen domains. State-of-the-art methods demonstrate strong performance when trained and tested within the same domain. However, they struggle to generalise to new environments due to limited training data and depth perception -- overfitting to specific camera intrinsics. Our method addresses this by estimating keypoint z-coordinates in a virtual camera space, normalised by focal length and image size, enabling camera-agnostic depth prediction. We further leverage this invariance to camera intrinsics to propose a self-supervised test-time optimisation strategy that refines the model's depth perception during inference. This is achieved by applying a 3D consistency loss between predicted and in-space scale-transformed hand poses, allowing the model to adapt to target domain characteristics without requiring ground truth annotations. V-HPOT significantly improves 3D hand pose estimation performance in cross-domain scenarios, achieving a 71% reduction in mean pose error on the H2O dataset and a 41% reduction on the AssemblyHands dataset. Compared to state-of-the-art methods, V-HPOT outperforms all single-stage approaches across all datasets and competes closely with two-stage methods, despite needing approximately x3.5 to x14 less data.
翻译:我们提出V-HPOT,一种新颖的方法,用于提升自我中心视角图像中三维手部姿态估计在多样化未知领域间的跨域性能。现有最先进方法在相同领域内训练和测试时表现出色,但由于训练数据有限和深度感知能力不足——过度拟合特定相机内参——它们难以泛化到新环境。我们的方法通过在虚拟相机空间中估计关键点的z坐标(通过焦距和图像尺寸进行归一化)来解决这一问题,从而实现与相机无关的深度预测。我们进一步利用这种对相机内参的不变性,提出一种自监督的测试时优化策略,在推理过程中细化模型的深度感知能力。这是通过在预测的手部姿态与空间尺度变换后的手部姿态之间施加三维一致性损失来实现的,使得模型能够适应目标领域特征而无需真实标注。V-HPOT在跨域场景中显著提升了三维手部姿态估计性能,在H2O数据集上平均姿态误差降低71%,在AssemblyHands数据集上降低41%。与最先进方法相比,V-HPOT在所有数据集上均优于所有单阶段方法,并与两阶段方法表现相当,尽管所需数据量减少约3.5至14倍。