Human Pose Estimation (HPE) involves detecting and localizing keypoints on the human body from visual data. In 3D HPE, occlusions, where parts of the body are not visible in the image, pose a significant challenge for accurate pose reconstruction. This paper presents a benchmark on the robustness of 3D HPE models under realistic occlusion conditions, involving combinations of occluded keypoints commonly observed in real-world scenarios. We evaluate nine state-of-the-art 2D-to-3D HPE models, spanning convolutional, transformer-based, graph-based, and diffusion-based architectures, using the BlendMimic3D dataset, a synthetic dataset with ground-truth 2D/3D annotations and occlusion labels. All models were originally trained on Human3.6M and tested here without retraining to assess their generalization. We introduce a protocol that simulates occlusion by adding noise into 2D keypoints based on real detector behavior, and conduct both global and per-joint sensitivity analyses. Our findings reveal that all models exhibit notable performance degradation under occlusion, with diffusion-based models underperforming despite their stochastic nature. Additionally, a per-joint occlusion analysis identifies consistent vulnerability in distal joints (e.g., wrists, feet) across models. Overall, this work highlights critical limitations of current 3D HPE models in handling occlusions, and provides insights for improving real-world robustness.
翻译:人体姿态估计(HPE)旨在从视觉数据中检测并定位人体关键点。在三维HPE中,遮挡(即身体部位在图像中不可见)对精确的姿态重建构成了重大挑战。本文提出了一个在真实遮挡条件下评估三维HPE模型鲁棒性的基准测试,涵盖了现实场景中常见的遮挡关键点组合。我们利用BlendMimic3D数据集(一个包含真实二维/三维标注及遮挡标签的合成数据集),评估了九种先进的二维到三维HPE模型,这些模型涵盖了基于卷积、Transformer、图神经网络以及扩散模型的架构。所有模型均在Human3.6M数据集上训练,并在此直接测试(未进行微调)以评估其泛化能力。我们提出了一种通过模拟真实检测器行为向二维关键点添加噪声来生成遮挡的协议,并进行了全局及逐关节的敏感性分析。研究结果表明,所有模型在遮挡条件下均表现出显著的性能下降,其中基于扩散的模型尽管具有随机性,但表现欠佳。此外,逐关节遮挡分析揭示了所有模型在远端关节(如手腕、脚部)存在一致的脆弱性。总体而言,本研究揭示了当前三维HPE模型在处理遮挡方面的关键局限性,并为提升其实世界鲁棒性提供了见解。