Cross-modal retrieval of image-text and video-text is a prominent research area in computer vision and natural language processing. However, there has been insufficient attention given to cross-modal retrieval between human motion and text, despite its wide-ranging applicability. To address this gap, we utilize a concise yet effective dual-unimodal transformer encoder for tackling this task. Recognizing that overlapping atomic actions in different human motion sequences can lead to semantic conflicts between samples, we explore a novel triplet loss function called DropTriple Loss. This loss function discards false negative samples from the negative sample set and focuses on mining remaining genuinely hard negative samples for triplet training, thereby reducing violations they cause. We evaluate our model and approach on the HumanML3D and KIT Motion-Language datasets. On the latest HumanML3D dataset, we achieve a recall of 62.9% for motion retrieval and 71.5% for text retrieval (both based on R@10). The source code for our approach is publicly available at https://github.com/eanson023/rehamot.
翻译:图像-文本与视频-文本的跨模态检索是计算机视觉与自然语言处理领域的重要研究方向。然而,尽管具有广泛的应用前景,人体运动与文本之间的跨模态检索尚未得到充分关注。为填补这一空白,我们采用一种简洁高效的双单模态Transformer编码器来处理此任务。针对不同人体运动序列中重叠的原子动作可能导致样本间语义冲突的问题,我们探索了一种新颖的三元组损失函数,称为DropTriple损失。该损失函数从负样本集中剔除假阴性样本,并专注于挖掘剩余的真正困难负样本进行三元组训练,从而减少其导致的违规情况。我们在HumanML3D和KIT Motion-Language数据集上评估了我们的模型与方法。在最新的HumanML3D数据集上,我们实现了运动检索62.9%和文本检索71.5%的召回率(均基于R@10)。本方法的源代码已公开于https://github.com/eanson023/rehamot。