Cross-modal retrieval of image-text and video-text is a prominent research area in computer vision and natural language processing. However, there has been insufficient attention given to cross-modal retrieval between human motion and text, despite its wide-ranging applicability. To address this gap, we utilize a concise yet effective dual-unimodal transformer encoder for tackling this task. Recognizing that overlapping atomic actions in different human motion sequences can lead to semantic conflicts between samples, we explore a novel triplet loss function called DropTriple Loss. This loss function discards false negative samples from the negative sample set and focuses on mining remaining genuinely hard negative samples for triplet training, thereby reducing violations they cause. We evaluate our model and approach on the HumanML3D and KIT Motion-Language datasets. On the latest HumanML3D dataset, we achieve a recall of 62.9% for motion retrieval and 71.5% for text retrieval (both based on R@10). The source code for our approach is publicly available at https://github.com/eanson023/rehamot.
翻译:图像-文本与视频-文本的跨模态检索是计算机视觉和自然语言处理领域的显著研究方向。然而,尽管人体运动与文本之间的跨模态检索具有广泛的应用前景,该领域尚未得到足够关注。为填补这一空白,我们采用简洁高效的双单模态Transformer编码器来处理该任务。考虑到不同人体运动序列中重叠的原子动作可能导致样本间语义冲突,我们探索了一种称为DropTriple Loss的新型三元组损失函数。该损失函数通过从负样本集中剔除假负样本,并聚焦于挖掘剩余的真实难负样本进行三元组训练,从而减少其造成的违规现象。我们在HumanML3D和KIT运动-语言数据集上评估了我们的模型与方法。在最新的HumanML3D数据集上,我们的方法在R@10指标下,运动检索召回率达62.9%,文本检索召回率达71.5%。本方法源代码已公开于https://github.com/eanson023/rehamot。