Its numerous applications make multi-human 3D pose estimation a remarkably impactful area of research. Nevertheless, assuming a multiple-view system composed of several regular RGB cameras, 3D multi-pose estimation presents several challenges. First of all, each person must be uniquely identified in the different views to separate the 2D information provided by the cameras. Secondly, the 3D pose estimation process from the multi-view 2D information of each person must be robust against noise and potential occlusions in the scenario. In this work, we address these two challenges with the help of deep learning. Specifically, we present a model based on Graph Neural Networks capable of predicting the cross-view correspondence of the people in the scenario along with a Multilayer Perceptron that takes the 2D points to yield the 3D poses of each person. These two models are trained in a self-supervised manner, thus avoiding the need for large datasets with 3D annotations.
翻译:其广泛应用使得多人三维姿态估计成为一个具有显著影响力的研究领域。然而,在由多个常规RGB摄像头组成的多视角系统假设下,三维多人姿态估计面临若干挑战。首先,需在各视角中唯一识别每个个体,以分离摄像头提供的二维信息。其次,基于每个个体多视角二维信息的三维姿态估计过程,必须对场景中的噪声和潜在遮挡具有鲁棒性。本研究借助深度学习应对上述两项挑战。具体而言,我们提出一种基于图神经网络的模型,可预测场景中人物的跨视角对应关系,并辅以多层感知器,将二维点坐标转化为每个个体的三维姿态。这两个模型以自监督方式进行训练,从而避免了对大规模三维标注数据集的依赖。