We present a unified perspective on tackling various human-centric video tasks by learning human motion representations from large-scale and heterogeneous data resources. Specifically, we propose a pretraining stage in which a motion encoder is trained to recover the underlying 3D motion from noisy partial 2D observations. The motion representations acquired in this way incorporate geometric, kinematic, and physical knowledge about human motion, which can be easily transferred to multiple downstream tasks. We implement the motion encoder with a Dual-stream Spatio-temporal Transformer (DSTformer) neural network. It could capture long-range spatio-temporal relationships among the skeletal joints comprehensively and adaptively, exemplified by the lowest 3D pose estimation error so far when trained from scratch. Furthermore, our proposed framework achieves state-of-the-art performance on all three downstream tasks by simply finetuning the pretrained motion encoder with a simple regression head (1-2 layers), which demonstrates the versatility of the learned motion representations. Code and models are available at https://motionbert.github.io/
翻译:我们提出了一种统一视角,通过从大规模和异构数据资源中学习人体运动表示来处理各种以人为中心的视频任务。具体而言,我们提出了一个预训练阶段,在该阶段中,运动编码器被训练用于从带噪声的局部2D观测中恢复底层3D运动。以这种方式获得的运动表示融入了关于人体运动的几何、运动学和物理知识,可以轻松迁移到多个下游任务。我们使用双流时空Transformer(DSTformer)神经网络实现了运动编码器。该网络能够全面且自适应地捕获骨骼关节之间的长程时空关系,其从零训练时实现了迄今为止最低的3D姿态估计误差。此外,我们提出的框架仅需通过带简单回归头(1-2层)对预训练的运动编码器进行微调,即可在所有三个下游任务上达到最先进的性能,这证明了所学运动表示的通用性。代码和模型可在 https://motionbert.github.io/ 获取。