Recently, several methods have been proposed to estimate 3D human pose from multi-view images and achieved impressive performance on public datasets collected in relatively easy scenarios. However, there are limited approaches for extracting 3D human skeletons from multimodal inputs (e.g., RGB and pointcloud) that can enhance the accuracy of predicting 3D poses in challenging situations. We fill this gap by introducing a pipeline called PointVoxel that fuses multi-view RGB and pointcloud inputs to obtain 3D human poses. We demonstrate that volumetric representation is an effective architecture for integrating these different modalities. Moreover, in order to overcome the challenges of annotating 3D human pose labels in difficult scenarios, we develop a synthetic dataset generator for pretraining and design an unsupervised domain adaptation strategy so that we can obtain a well-trained 3D human pose estimator without using any manual annotations. We evaluate our approach on four datasets (two public datasets, one synthetic dataset, and one challenging dataset named BasketBall collected by ourselves), showing promising results. The code and dataset will be released soon.
翻译:近期,已有多种方法从多视角图像中估计3D人体姿态,并在相对简单场景下采集的公开数据集中取得了优异性能。然而,针对利用多模态输入(如RGB与点云)从复杂环境中提升3D姿态预测精度的算法仍然有限。为填补这一空白,我们提出名为PointVoxel的流程,通过融合多视角RGB与点云输入来获取3D人体姿态。实验证明,体素化表示是整合这些不同模态的有效架构。此外,为克服困难场景中3D人体姿态标注的挑战,我们开发了用于预训练的合成数据集生成器,并设计了无监督域适应策略,从而无需任何人工标注即可获得训练完备的3D人体姿态估计器。我们在四个数据集(两个公开数据集、一个合成数据集及自主采集的具有挑战性的BasketBall数据集)上评估方法,取得了具有前景的结果。代码与数据集即将开源。