Due to recent advances in pose-estimation methods, human motion can be extracted from a common video in the form of 3D skeleton sequences. Despite wonderful application opportunities, effective and efficient content-based access to large volumes of such spatio-temporal skeleton data still remains a challenging problem. In this paper, we propose a novel content-based text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural-language textual description. To define baselines for this uncharted task, we employ the BERT and CLIP language representations to encode the text modality and successful spatio-temporal models to encode the motion modality. We additionally introduce our transformer-based approach, called Motion Transformer (MoT), which employs divided space-time attention to effectively aggregate the different skeleton joints in space and time. Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions. Finally, we set up a common evaluation protocol by defining qualitative metrics for assessing the quality of the retrieved motions, targeting the two recently-introduced KIT Motion-Language and HumanML3D datasets. The code for reproducing our results is available at https://github.com/mesnico/text-to-motion-retrieval.
翻译:由于姿态估计方法的最新进展,人体运动可通过常见视频以3D骨架序列的形式提取。尽管存在诸多令人兴奋的应用场景,如何高效且有效地基于内容检索大规模此类时空骨架数据仍是一个具有挑战性的问题。本文提出了一项新颖的基于内容的文本到动作检索任务,旨在根据指定的自然语言文本描述检索相关动作。为定义这一开创性任务的基线,我们采用BERT和CLIP语言表示来编码文本模态,并使用成功的时空模型来编码运动模态。此外,我们引入基于Transformer的方法,即运动Transformer(MoT),该方法采用分时空间注意力机制有效聚合不同骨架关节点在空间与时间上的信息。受文本-图像/视频匹配领域最新进展的启发,我们实验了两种广泛采用的度量学习损失函数。最后,通过定义针对最近发布的两个基准数据集(KIT运动语言数据集和HumanML3D数据集)的定性评估指标,我们建立了统一的评估协议。复现结果的代码可从https://github.com/mesnico/text-to-motion-retrieval获取。