Single-to-Dual-View Adaptation for Egocentric 3D Hand Pose Estimation

The pursuit of accurate 3D hand pose estimation stands as a keystone for understanding human activity in the realm of egocentric vision. The majority of existing estimation methods still rely on single-view images as input, leading to potential limitations, e.g., limited field-of-view and ambiguity in depth. To address these problems, adding another camera to better capture the shape of hands is a practical direction. However, existing multi-view hand pose estimation methods suffer from two main drawbacks: 1) Requiring multi-view annotations for training, which are expensive. 2) During testing, the model becomes inapplicable if camera parameters/layout are not the same as those used in training. In this paper, we propose a novel Single-to-Dual-view adaptation (S2DHand) solution that adapts a pre-trained single-view estimator to dual views. Compared with existing multi-view training methods, 1) our adaptation process is unsupervised, eliminating the need for multi-view annotation. 2) Moreover, our method can handle arbitrary dual-view pairs with unknown camera parameters, making the model applicable to diverse camera settings. Specifically, S2DHand is built on certain stereo constraints, including pair-wise cross-view consensus and invariance of transformation between both views. These two stereo constraints are used in a complementary manner to generate pseudo-labels, allowing reliable adaptation. Evaluation results reveal that S2DHand achieves significant improvements on arbitrary camera pairs under both in-dataset and cross-dataset settings, and outperforms existing adaptation methods with leading performance. Project page: https://github.com/MickeyLLG/S2DHand.

翻译：追求准确的3D手部姿态估计是理解自我中心视觉中人类活动的关键。现有的大多数估计方法仍以单视图图像作为输入，导致潜在限制，例如视野有限和深度模糊。为解决这些问题，增加另一摄像头以更好地捕捉手部形状是实际可行的方向。然而，现有的多视图手部姿态估计方法存在两个主要缺陷：1）训练时需要昂贵的多视图标注；2）测试时，如果相机参数/布局与训练时不同，模型将无法适用。本文提出一种新颖的单视图到双视图自适应（S2DHand）解决方案，将预训练的单视图估计器自适应到双视图。与现有多视图训练方法相比：1）我们的自适应过程是无监督的，无需多视图标注；2）此外，该方法可处理任意未知相机参数的双视图对，使模型适用于多种相机设置。具体地，S2DHand构建于特定的立体约束之上，包括成对的跨视图一致性和两视图间变换的不变性。这两种立体约束以互补方式生成伪标签，从而实现可靠的自适应。评估结果表明，S2DHand在数据集内和跨数据集场景下的任意相机对上均取得显著改进，并以领先性能超越现有自适应方法。项目页面：https://github.com/MickeyLLG/S2DHand。