Comparison of marker-less 2D image-based methods for infant pose estimation

There are increasing efforts to automate clinical methods for early diagnosis of developmental disorders, among them the General Movement Assessment (GMA), a video-based tool to classify infant motor functioning. Optimal pose estimation is a crucial part of the automated GMA. In this study we compare the performance of available generic- and infant-pose estimators, and the choice of viewing angle for optimal recordings, i.e., conventional diagonal view used in GMA vs. top-down view. For this study, we used 4500 annotated video-frames from 75 recordings of infant spontaneous motor functions from 4 to 26 weeks. To determine which available pose estimation method and camera angle yield the best pose estimation accuracy on infants in a GMA related setting, the distance to human annotations as well as the percentage of correct key-points (PCK) were computed and compared. The results show that the best performing generic model trained on adults, ViTPose, also performs best on infants. We see no improvement from using specialized infant-pose estimators over the generic pose estimators on our own infant dataset. However, when retraining a generic model on our data, there is a significant improvement in pose estimation accuracy. The pose estimation accuracy obtained from the top-down view is significantly better than that obtained from the diagonal view, especially for the detection of the hip key-points. The results also indicate only limited generalization capabilities of infant-pose estimators to other infant datasets, which hints that one should be careful when choosing infant pose estimators and using them on infant datasets which they were not trained on. While the standard GMA method uses a diagonal view for assessment, pose estimation accuracy significantly improves using a top-down view. This suggests that a top-down view should be included in recording setups for automated GMA research.

翻译：随着早期发育障碍自动化临床诊断方法的不断推进，基于视频的婴儿运动功能分类工具——全身运动评估（GMA）日益受到关注。精准的姿态估计是自动化GMA的关键环节。本研究比较了现有通用姿态估计器与婴儿专用姿态估计器的性能，并探讨了最佳拍摄视角的选择，即GMA常规采用的斜对角视角与俯视视角。研究使用了来自75段婴儿（4至26周龄）自发运动功能录像的4500帧标注视频数据。为确定在GMA相关场景下何种现有姿态估计方法与摄像头角度能获得最优婴儿姿态估计精度，我们计算并比较了与人工标注的距离以及关键点正确率（PCK）。结果表明，在成人数据上训练的通用模型ViTPose在婴儿数据上同样表现最佳。在我们自有的婴儿数据集上，使用专用婴儿姿态估计器相比通用估计器未见改进。然而，当使用我们的数据对通用模型进行再训练时，姿态估计精度得到显著提升。俯视视角获得的姿态估计精度明显优于斜对角视角，尤其在髋部关键点检测方面。结果还表明婴儿姿态估计器对其他婴儿数据集的泛化能力有限，提示在选择婴儿姿态估计器并将其应用于未经训练的数据集时需谨慎。虽然标准GMA方法采用斜对角视角进行评估，但使用俯视视角能显著提升姿态估计精度。这提示在自动化GMA研究的录制设置中应考虑纳入俯视视角。