We present a unified perspective on tackling various human-centric video tasks by learning human motion representations from large-scale and heterogeneous data resources. Specifically, we propose a pretraining stage in which a motion encoder is trained to recover the underlying 3D motion from noisy partial 2D observations. The motion representations acquired in this way incorporate geometric, kinematic, and physical knowledge about human motion, which can be easily transferred to multiple downstream tasks. We implement the motion encoder with a Dual-stream Spatio-temporal Transformer (DSTformer) neural network. It could capture long-range spatio-temporal relationships among the skeletal joints comprehensively and adaptively, exemplified by the lowest 3D pose estimation error so far when trained from scratch. Furthermore, our proposed framework achieves state-of-the-art performance on all three downstream tasks by simply finetuning the pretrained motion encoder with a simple regression head (1-2 layers), which demonstrates the versatility of the learned motion representations.
翻译:我们提出一种统一视角,通过从大规模异构数据资源中学习人体运动表征来解决多种以人为中心的视频任务。具体而言,我们提出一个预训练阶段,训练运动编码器从含噪的局部二维观测中恢复底层三维运动。由此获得的运动表征融合了人体运动的几何、运动学及物理知识,可轻松迁移至多个下游任务。我们采用双流时空Transformer(DSTformer)神经网络实现运动编码器,该网络能够全面自适应地捕捉骨架关节间的长程时空关系,其从零训练时的三维姿态估计误差为目前最低。此外,通过仅对预训练运动编码器附加简单回归头(1-2层)进行微调,我们提出的框架在三个下游任务上均达到最先进性能,这充分验证了所学运动表征的通用性。