Recent advancements in 3D hand pose estimation have shown promising results, but its effectiveness has primarily relied on the availability of large-scale annotated datasets, the creation of which is a laborious and costly process. To alleviate the label-hungry limitation, we propose a self-supervised learning framework, HaMuCo, that learns a single-view hand pose estimator from multi-view pseudo 2D labels. However, one of the main challenges of self-supervised learning is the presence of noisy labels and the ``groupthink'' effect from multiple views. To overcome these issues, we introduce a cross-view interaction network that distills the single-view estimator by utilizing the cross-view correlated features and enforcing multi-view consistency to achieve collaborative learning. Both the single-view estimator and the cross-view interaction network are trained jointly in an end-to-end manner. Extensive experiments show that our method can achieve state-of-the-art performance on multi-view self-supervised hand pose estimation. Furthermore, the proposed cross-view interaction network can also be applied to hand pose estimation from multi-view input and outperforms previous methods under the same settings.
翻译:摘要:近年来,三维手部姿态估计取得了显著进展,但其有效性主要依赖于大规模标注数据集的可用性,而构建此类数据集的过程耗时且成本高昂。为缓解对标注数据的依赖,我们提出一种自监督学习框架HaMuCo,该框架通过多视图伪二维标签学习单视图手部姿态估计器。然而,自监督学习的主要挑战之一在于多视图带来的噪声标签及“群体思维”效应。为克服这些问题,我们引入一种跨视图交互网络,通过利用跨视图相关特征并强制多视图一致性来实现协同学习,从而蒸馏单视图估计器。单视图估计器与跨视图交互网络以端到端方式联合训练。大量实验表明,我们的方法在多视图自监督手部姿态估计任务上达到了最先进的性能。此外,所提出的跨视图交互网络同样可应用于多视图输入的手部姿态估计,在相同设置下优于以往方法。