Recently, several methods have been proposed to estimate 3D human pose from multi-view images and achieved impressive performance on public datasets collected in relatively easy scenarios. However, there are limited approaches for extracting 3D human skeletons from multimodal inputs (e.g., RGB and pointcloud) that can enhance the accuracy of predicting 3D poses in challenging situations. We fill this gap by introducing a pipeline called PointVoxel that fuses multi-view RGB and pointcloud inputs to obtain 3D human poses. We demonstrate that volumetric representation is an effective architecture for integrating these different modalities. Moreover, in order to overcome the challenges of annotating 3D human pose labels in difficult scenarios, we develop a synthetic dataset generator for pretraining and design an unsupervised domain adaptation strategy so that we can obtain a well-trained 3D human pose estimator without using any manual annotations. We evaluate our approach on four datasets (two public datasets, one synthetic dataset, and one challenging dataset named BasketBall collected by ourselves), showing promising results. The code and dataset will be released soon.
翻译:摘要:近年来,研究者提出了多种从多视角图像估计三维人体姿态的方法,并在相对简单场景的公开数据集上取得了显著性能。然而,针对多模态输入(如RGB和点云)提取三维人体骨架的方法较为有限,这类方法在复杂情况下可提升三维姿态预测的准确性。我们通过提出一种名为PointVoxel的流程填补了这一空白,该流程融合多视角RGB和点云输入以获取三维人体姿态。研究表明,体素表示是整合这些不同模态信息的有效架构。此外,为克服复杂场景中标注三维人体姿态标签的挑战,我们开发了一种用于预训练的合成数据集生成器,并设计了无监督域适应策略,从而在无需任何人工标注的情况下获得训练良好的三维人体姿态估计器。我们在四个数据集(两个公开数据集、一个合成数据集及我们自行采集的具有挑战性的BasketBall数据集)上评估了该方法,展示了令人满意的结果。代码和数据集将很快公开。