Due to recent advances in pose-estimation methods, human motion can be extracted from a common video in the form of 3D skeleton sequences. Despite wonderful application opportunities, effective and efficient content-based access to large volumes of such spatio-temporal skeleton data still remains a challenging problem. In this paper, we propose a novel content-based text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural-language textual description. To define baselines for this uncharted task, we employ the BERT and CLIP language representations to encode the text modality and successful spatio-temporal models to encode the motion modality. We additionally introduce our transformer-based approach, called Motion Transformer (MoT), which employs divided space-time attention to effectively aggregate the different skeleton joints in space and time. Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions. Finally, we set up a common evaluation protocol by defining qualitative metrics for assessing the quality of the retrieved motions, targeting the two recently-introduced KIT Motion-Language and HumanML3D datasets. The code for reproducing our results is available at https://github.com/mesnico/text-to-motion-retrieval.
翻译:由于姿态估计方法的最新进展,人体运动可以从普通视频中以三维骨骼序列的形式提取。尽管存在诸多应用机遇,但如何对大规模此类时空骨骼数据进行高效且基于内容的检索仍是一个具有挑战性的问题。本文提出了一项新颖的基于内容的文本到运动检索任务,旨在根据指定的自然语言文本描述检索相关运动。为定义这一未知任务的基线,我们采用BERT和CLIP语言表示编码文本模态,并采用成功的时空模型编码运动模态。此外,我们引入基于Transformer的方法——运动Transformer(MoT),该方法利用分割的时空注意力有效聚合空间中不同骨骼关节与时间序列。受文本到图像/视频匹配领域最新进展启发,我们实验了两种广泛采用的度量学习损失函数。最终,我们通过定义定性评估指标(针对近期引入的KIT运动语言数据集和HumanML3D数据集)建立统一评估协议,以衡量检索到的运动质量。复现结果的代码可访问 https://github.com/mesnico/text-to-motion-retrieval。