We address the challenges in estimating 3D human poses from multiple views under occlusion and with limited overlapping views. We approach multi-view, single-person 3D human pose reconstruction as a regression problem and propose a novel encoder-decoder Transformer architecture to estimate 3D poses from multi-view 2D pose sequences. The encoder refines 2D skeleton joints detected across different views and times, fusing multi-view and temporal information through global self-attention. We enhance the encoder by incorporating a geometry-biased attention mechanism, effectively leveraging geometric relationships between views. Additionally, we use detection scores provided by the 2D pose detector to further guide the encoder's attention based on the reliability of the 2D detections. The decoder subsequently regresses the 3D pose sequence from these refined tokens, using pre-defined queries for each joint. To enhance the generalization of our method to unseen scenes and improve resilience to missing joints, we implement strategies including scene centering, synthetic views, and token dropout. We conduct extensive experiments on three benchmark public datasets, Human3.6M, CMU Panoptic and Occlusion-Persons. Our results demonstrate the efficacy of our approach, particularly in occluded scenes and when few views are available, which are traditionally challenging scenarios for triangulation-based methods.
翻译:我们解决了在遮挡和有限重叠视角条件下从多视角估计三维人体姿态的挑战。我们将多视角单人三维人体姿态重建视为回归问题,提出了一种新颖的编码器-解码器Transformer架构,用于从多视角二维姿态序列估计三维姿态。编码器通过全局自注意力机制融合多视角和时间信息,精炼不同视角和时间检测到的二维骨架关节点。我们通过引入基于几何偏置的注意力机制来增强编码器,有效利用视角间的几何关系。此外,我们利用二维姿态检测器提供的检测得分,基于二维检测的可靠性进一步引导编码器的注意力。解码器随后从这些精炼的标记中回归出三维姿态序列,使用预定义的关节点查询。为了增强方法对未见场景的泛化能力并提高对缺失关节的鲁棒性,我们实施了包括场景居中、合成视角和标记丢弃在内的策略。我们在三个公开基准数据集Human3.6M、CMU Panoptic和Occlusion-Persons上进行了大量实验。我们的结果证明了该方法在遮挡场景和视角数量有限情况下的有效性,这些情况传统上对于基于三角测量的方法来说非常具有挑战性。