Joint-Dataset Learning and Cross-Consistent Regularization for Text-to-Motion Retrieval

Pose-estimation methods enable extracting human motion from common videos in the structured form of 3D skeleton sequences. Despite great application opportunities, effective content-based access to such spatio-temporal motion data is a challenging problem. In this paper, we focus on the recently introduced text-motion retrieval tasks, which aim to search for database motions that are the most relevant to a specified natural-language textual description (text-to-motion) and vice-versa (motion-to-text). Despite recent efforts to explore these promising avenues, a primary challenge remains the insufficient data available to train robust text-motion models effectively. To address this issue, we propose to investigate joint-dataset learning - where we train on multiple text-motion datasets simultaneously - together with the introduction of a Cross-Consistent Contrastive Loss function (CCCL), which regularizes the learned text-motion common space by imposing uni-modal constraints that augment the representation ability of the trained network. To learn a proper motion representation, we also introduce a transformer-based motion encoder, called MoT++, which employs spatio-temporal attention to process sequences of skeleton data. We demonstrate the benefits of the proposed approaches on the widely-used KIT Motion-Language and HumanML3D datasets. We perform detailed experimentation on joint-dataset learning and cross-dataset scenarios, showing the effectiveness of each introduced module in a carefully conducted ablation study and, in turn, pointing out the limitations of state-of-the-art methods.

翻译：姿态估计方法能够从普通视频中以三维骨架序列的结构化形式提取人体动作。尽管具有巨大的应用潜力，对此类时空动作数据进行有效的基于内容的访问仍是一个具有挑战性的问题。本文聚焦于近期提出的文本-动作检索任务，其目标是在数据库中搜索与指定自然语言文本描述最相关的动作（文本到动作）以及反之（动作到文本）。尽管近期已有探索这些前景广阔方向的研究，一个主要挑战仍然是可用于有效训练鲁棒文本-动作模型的训练数据不足。为解决此问题，我们提出研究联合数据集学习——即在多个文本-动作数据集上同时进行训练——并引入跨一致性对比损失函数，该函数通过施加增强训练网络表示能力的单模态约束来正则化学习到的文本-动作共享空间。为学习合适的动作表示，我们还引入了一种基于Transformer的动作编码器，称为MoT++，该编码器利用时空注意力机制处理骨架数据序列。我们在广泛使用的KIT Motion-Language和HumanML3D数据集上证明了所提方法的优势。我们对联合数据集学习和跨数据集场景进行了详细的实验，通过精心设计的消融研究展示了每个引入模块的有效性，并进而指出了现有最先进方法的局限性。