Automatic markerless estimation of infant posture and motion from ordinary videos carries great potential for movement studies "in the wild", facilitating understanding of motor development and massively increasing the chances of early diagnosis of disorders. There is rapid development of human pose estimation methods in computer vision thanks to advances in deep learning and machine learning. However, these methods are trained on datasets featuring adults in different contexts. This work tests and compares seven popular methods (AlphaPose, DeepLabCut/DeeperCut, Detectron2, HRNet, MediaPipe/BlazePose, OpenPose, and ViTPose) on videos of infants in supine position. Surprisingly, all methods except DeepLabCut and MediaPipe have competitive performance without additional finetuning, with ViTPose performing best. Next to standard performance metrics (object keypoint similarity, average precision and recall), we introduce errors expressed in the neck-mid-hip ratio and additionally study missed and redundant detections and the reliability of the internal confidence ratings of the different methods, which are relevant for downstream tasks. Among the networks with competitive performance, only AlphaPose could run close to real time (27 fps) on our machine. We provide documented Docker containers or instructions for all the methods we used, our analysis scripts, and processed data at https://hub.docker.com/u/humanoidsctu and https://osf.io/x465b/.
翻译:从普通视频中自动无标记地估计婴儿的姿态与运动,为"野外"运动研究带来了巨大潜力,有助于理解运动发育并极大提高早期诊断障碍的机会。得益于深度学习和机器学习的进步,计算机视觉中人体姿态估计方法发展迅速。然而,这些方法均在以成年人为背景的不同数据集上训练。本研究测试并比较了七种流行方法(AlphaPose、DeepLabCut/DeeperCut、Detectron2、HRNet、MediaPipe/BlazePose、OpenPose和ViTPose)在仰卧位婴儿视频上的表现。令人惊讶的是,除DeepLabCut和MediaPipe外,所有方法在未经额外微调的情况下均表现出竞争性性能,其中ViTPose表现最佳。除标准性能指标(目标关键点相似度、平均精确率与召回率)外,我们引入了以颈-中-髋比表示的误差,并额外研究了漏检与冗余检测问题以及不同方法内部置信度评级的可靠性——这些对下游任务具有重要意义。在具有竞争性能的网络中,仅AlphaPose能在我们的机器上接近实时运行(27帧/秒)。我们为所有使用的方法提供了带文档的Docker容器或安装说明,相关分析脚本及处理数据发布于https://hub.docker.com/u/humanoidsctu 与 https://osf.io/x465b/。