Human pose estimation is a key task in computer vision with various applications such as activity recognition and interactive systems. However, the lack of consistency in the annotated skeletons across different datasets poses challenges in developing universally applicable models. To address this challenge, we propose a novel approach integrating multi-teacher knowledge distillation with a unified skeleton representation. Our networks are jointly trained on the COCO and MPII datasets, containing 17 and 16 keypoints, respectively. We demonstrate enhanced adaptability by predicting an extended set of 21 keypoints, 4 (COCO) and 5 (MPII) more than original annotations, improving cross-dataset generalization. Our joint models achieved an average accuracy of 70.89 and 76.40, compared to 53.79 and 55.78 when trained on a single dataset and evaluated on both. Moreover, we also evaluate all 21 predicted points by our two models by reporting an AP of 66.84 and 72.75 on the Halpe dataset. This highlights the potential of our technique to address one of the most pressing challenges in pose estimation research and application - the inconsistency in skeletal annotations.
翻译:人体姿态估计是计算机视觉中的关键任务,在行为识别和交互系统等领域具有广泛应用。然而,不同数据集中标注骨架的不一致性给开发通用模型带来了挑战。为解决这一问题,我们提出了一种将多教师知识蒸馏与统一骨架表示相结合的新方法。我们的网络在分别包含17和16个关键点的COCO和MPII数据集上联合训练。通过预测包含21个关键点的扩展集合(比原始标注分别多4个(COCO)和5个(MPII)关键点),我们证明了该方法具有更强的适应能力,提升了跨数据集泛化性能。我们的联合模型取得了70.89和76.40的平均准确率,而单数据集训练并在两个数据集上评估的模型仅获得53.79和55.78的准确率。此外,我们通过在Halpe数据集上报告66.84和72.75的平均精度(AP),评估了两个模型预测的全部21个关键点。这凸显了我们的技术在解决姿态估计研究和应用中最紧迫挑战之一——骨架标注不一致性方面的潜力。