RSB-Pose: Robust Short-Baseline Binocular 3D Human Pose Estimation with Occlusion Handling

In the domain of 3D Human Pose Estimation, which finds widespread daily applications, the requirement for convenient acquisition equipment continues to grow. To satisfy this demand, we set our sights on a short-baseline binocular setting that offers both portability and a geometric measurement property that radically mitigates depth ambiguity. However, as the binocular baseline shortens, two serious challenges emerge: first, the robustness of 3D reconstruction against 2D errors deteriorates; and second, occlusion reoccurs due to the limited visual differences between two views. To address the first challenge, we propose the Stereo Co-Keypoints Estimation module to improve the view consistency of 2D keypoints and enhance the 3D robustness. In this module, the disparity is utilized to represent the correspondence of binocular 2D points and the Stereo Volume Feature is introduced to contain binocular features across different disparities. Through the regression of SVF, two-view 2D keypoints are simultaneously estimated in a collaborative way which restricts their view consistency. Furthermore, to deal with occlusions, a Pre-trained Pose Transformer module is introduced. Through this module, 3D poses are refined by perceiving pose coherence, a representation of joint correlations. This perception is injected by the Pose Transformer network and learned through a pre-training task that recovers iterative masked joints. Comprehensive experiments carried out on H36M and MHAD datasets, complemented by visualizations, validate the effectiveness of our approach in the short-baseline binocular 3D Human Pose Estimation and occlusion handling.

翻译：在三维人体姿态估计领域，因其在日常生活中的广泛应用，对便捷采集设备的需求日益增长。为满足这一需求，我们聚焦于短基线双目设置，该设置兼具便携性与几何测量特性，能从根本上缓解深度歧义性。然而，随着双目基线的缩短，两个严峻挑战随之浮现：首先，三维重建对二维误差的鲁棒性下降；其次，由于两个视角间的视觉差异有限，遮挡问题再次出现。为解决第一个挑战，我们提出立体共关键点估计模块，以提高二维关键点的视角一致性并增强三维鲁棒性。该模块中，利用视差表示双目二维点的对应关系，并引入立体体素特征以包含不同视差下的双目特征。通过SVF回归，以协作方式同时估计两个视角的二维关键点，从而约束其视角一致性。此外，为处理遮挡问题，引入预训练姿态Transformer模块。该模块通过感知姿态连贯性（即关节相关性表达）来优化三维姿态。这种感知由姿态Transformer网络注入，并通过恢复迭代掩码关节的预训练任务学习。在H36M和MHAD数据集上开展的综合实验及可视化结果，验证了本方法在短基线双目三维人体姿态估计与遮挡处理中的有效性。