3D hand pose estimation has made significant progress in recent years. However, the improvement is highly dependent on the emergence of large-scale annotated datasets. To alleviate the label-hungry limitation, we propose a multi-view collaborative self-supervised learning framework, HaMuCo, that estimates hand pose only with pseudo labels for training. We use a two-stage strategy to tackle the noisy label challenge and the multi-view ``groupthink'' problem. In the first stage, we estimate the 3D hand poses for each view independently. In the second stage, we employ a cross-view interaction network to capture the cross-view correlated features and use multi-view consistency loss to achieve collaborative learning among views. To further enhance the collaboration between single-view and multi-view, we fuse the results of all views to supervise the single-view network. To summarize, we introduce collaborative learning in two folds, the cross-view level and the multi- to single-view level. Extensive experiments show that our method can achieve state-of-the-art performance on multi-view self-supervised hand pose estimation. Moreover, ablation studies verify the effectiveness of each component. Results on multiple datasets further demonstrate the generalization ability of our network.
翻译:近年来,三维手部姿态估计取得了显著进展。然而,这种提升高度依赖于大规模标注数据集的涌现。为缓解对标签的依赖问题,我们提出了一种多视角协作自监督学习框架HaMuCo,仅使用伪标签训练即可估计手部姿态。我们采用两阶段策略应对噪声标签挑战和多视角“群体思维”问题。第一阶段,我们独立估计每个视角的三维手部姿态;第二阶段,我们利用跨视角交互网络捕获跨视角相关特征,并通过多视角一致性损失实现视角间的协同学习。为进一步增强单视角与多视角的协作,我们将所有视角的结果融合以监督单视角网络。总体而言,我们在两个层面引入协作学习:跨视角层面以及多视角到单视角层面。大量实验表明,我们的方法在多视角自监督手部姿态估计中达到了当前最优性能。此外,消融实验验证了各组件的有效性。在多个数据集上的结果进一步证明了网络的泛化能力。