Human keypoint detection for close proximity human-robot interaction

We study the performance of state-of-the-art human keypoint detectors in the context of close proximity human-robot interaction. The detection in this scenario is specific in that only a subset of body parts such as hands and torso are in the field of view. In particular, (i) we survey existing datasets with human pose annotation from the perspective of close proximity images and prepare and make publicly available a new Human in Close Proximity (HiCP) dataset; (ii) we quantitatively and qualitatively compare state-of-the-art human whole-body 2D keypoint detection methods (OpenPose, MMPose, AlphaPose, Detectron2) on this dataset; (iii) since accurate detection of hands and fingers is critical in applications with handovers, we evaluate the performance of the MediaPipe hand detector; (iv) we deploy the algorithms on a humanoid robot with an RGB-D camera on its head and evaluate the performance in 3D human keypoint detection. A motion capture system is used as reference. The best performing whole-body keypoint detectors in close proximity were MMPose and AlphaPose, but both had difficulty with finger detection. Thus, we propose a combination of MMPose or AlphaPose for the body and MediaPipe for the hands in a single framework providing the most accurate and robust detection. We also analyse the failure modes of individual detectors -- for example, to what extent the absence of the head of the person in the image degrades performance. Finally, we demonstrate the framework in a scenario where a humanoid robot interacting with a person uses the detected 3D keypoints for whole-body avoidance maneuvers.

翻译：我们研究了在近距离人机交互场景下最先进人体关键点检测器的性能。该场景下的检测具有特殊性，仅有一部分肢体（如手部和躯干）处于视野范围内。具体而言：（i）我们从近距离图像视角调研了现有带有人体姿态标注的数据集，并构建并公开了一个新的近距离人体（HiCP）数据集；（ii）我们在此数据集上对最先进的全身二维关键点检测方法（OpenPose、MMPose、AlphaPose、Detectron2）进行了定量与定性比较；（iii）鉴于在需要手部传递物体的应用中，手部和手指的精确检测至关重要，我们评估了MediaPipe手部检测器的性能；（iv）我们将这些算法部署在头部配备RGB-D相机的人形机器人上，并在三维人体关键点检测中评估其性能，同时以运动捕捉系统作为参照标准。在近距离场景下表现最佳的全身关键点检测器为MMPose和AlphaPose，但两者均在手部检测方面存在困难。因此，我们提出将MMPose或AlphaPose用于身体检测、MediaPipe用于手部检测的单一框架组合，以实现最精确且鲁棒的检测。我们还分析了各检测器的失效模式——例如，图像中人头部的缺失在多大程度上降低了检测性能。最后，我们在一个人形机器人与人类交互的场景中演示了该框架，利用检测到的三维关键点执行全身避让动作。