To build a cross-modal latent space between 3D human motion and language, acquiring large-scale and high-quality human motion data is crucial. However, unlike the abundance of image data, the scarcity of motion data has limited the performance of existing motion-language models. To counter this, we introduce "motion patches", a new representation of motion sequences, and propose using Vision Transformers (ViT) as motion encoders via transfer learning, aiming to extract useful knowledge from the image domain and apply it to the motion domain. These motion patches, created by dividing and sorting skeleton joints based on body parts in motion sequences, are robust to varying skeleton structures, and can be regarded as color image patches in ViT. We find that transfer learning with pre-trained weights of ViT obtained through training with 2D image data can boost the performance of motion analysis, presenting a promising direction for addressing the issue of limited motion data. Our extensive experiments show that the proposed motion patches, used jointly with ViT, achieve state-of-the-art performance in the benchmarks of text-to-motion retrieval, and other novel challenging tasks, such as cross-skeleton recognition, zero-shot motion classification, and human interaction recognition, which are currently impeded by the lack of data.
翻译:构建3D人体运动与语言之间的跨模态潜在空间,获取大规模高质量人体运动数据至关重要。然而与丰富的图像数据不同,运动数据的稀缺性限制了现有运动-语言模型的性能。为此,我们提出运动序列的新型表征"运动补丁",并建议使用视觉Transformer(ViT)通过迁移学习作为运动编码器,旨在从图像领域提取有用知识并应用于运动领域。这些运动补丁通过按运动序列中身体部位划分和排序骨骼关节点生成,对不同骨架结构具有鲁棒性,可类比ViT中的彩色图像补丁。研究发现,利用2D图像数据训练的ViT预训练权重进行迁移学习可提升运动分析性能,为解决运动数据有限的问题提供了有前景的方向。大量实验表明,所提出的运动补丁与ViT联合使用时,在文本-运动检索基准测试及其他新颖挑战性任务(如跨骨架识别、零样本运动分类和人体交互识别)中均达到最优性能,而这些任务目前正受限于数据匮乏的困境。