Recently, a few self-supervised representation learning (SSL) methods have outperformed the ImageNet classification pre-training for vision tasks such as object detection. However, its effects on 3D human body pose and shape estimation (3DHPSE) are open to question, whose target is fixed to a unique class, the human, and has an inherent task gap with SSL. We empirically study and analyze the effects of SSL and further compare it with other pre-training alternatives for 3DHPSE. The alternatives are 2D annotation-based pre-training and synthetic data pre-training, which share the motivation of SSL that aims to reduce the labeling cost. They have been widely utilized as a source of weak-supervision or fine-tuning, but have not been remarked as a pre-training source. SSL methods underperform the conventional ImageNet classification pre-training on multiple 3DHPSE benchmarks by 7.7% on average. In contrast, despite a much less amount of pre-training data, the 2D annotation-based pre-training improves accuracy on all benchmarks and shows faster convergence during fine-tuning. Our observations challenge the naive application of the current SSL pre-training to 3DHPSE and relight the value of other data types in the pre-training aspect.
翻译:近期,少数自监督表示学习方法在目标检测等视觉任务上已超越基于ImageNet分类的预训练方法。然而,其对以固定目标类别(人体)为对象的3D人体姿态与形状估计任务的影响尚存疑问,且与该任务存在固有任务差异。我们通过实证研究分析自监督学习的效果,并进一步将其与3D人体姿态与形状估计的其他预训练方案进行比较。这些替代方案包括基于2D标注的预训练和合成数据预训练,二者与自监督学习均以减少标注成本为动机。虽然它们已被广泛用作弱监督或微调的数据来源,但尚未被明确视为预训练方案。自监督学习方法在多个3D人体姿态与形状估计基准上的平均精度较传统ImageNet分类预训练低7.7%。相比之下,尽管预训练数据量显著减少,基于2D标注的预训练在所有基准上均提升了精度,并在微调过程中表现出更快的收敛速度。我们的观察质疑了当前自监督预训练在3D人体姿态与形状估计中的直接应用,并重新凸显了其他数据类型在预训练方面的价值。