Text-motion retrieval aims to learn a semantically aligned latent space between natural language descriptions and 3D human motion skeleton sequences, enabling bidirectional search across the two modalities. Most existing methods use a dual-encoder framework that compresses motion and text into global embeddings, discarding fine-grained local correspondences, and thus reducing accuracy. Additionally, these global-embedding methods offer limited interpretability of the retrieval results. To overcome these limitations, we propose an interpretable, joint-angle-based motion representation that maps joint-level local features into a structured pseudo-image, compatible with pre-trained Vision Transformers. For text-to-motion retrieval, we employ MaxSim, a token-wise late interaction mechanism, and enhance it with Masked Language Modeling regularization to foster robust, interpretable text-motion alignment. Extensive experiments on HumanML3D and KIT-ML show that our method outperforms state-of-the-art text-motion retrieval approaches while offering interpretable fine-grained correspondences between text and motion. The code is available in the supplementary material.
翻译:文本-运动检索旨在学习自然语言描述与三维人体运动骨架序列之间语义对齐的潜在空间,从而实现跨两种模态的双向搜索。现有方法大多采用双编码器框架,将运动和文本压缩为全局嵌入,丢弃了细粒度的局部对应关系,从而降低了检索精度。此外,这些基于全局嵌入的方法对检索结果的可解释性有限。为克服这些局限,我们提出了一种可解释的、基于关节角的运动表示方法,该方法将关节级局部特征映射为结构化伪图像,与预训练的 Vision Transformer 兼容。对于文本到运动检索,我们采用 MaxSim(一种令牌级延迟交互机制),并通过掩码语言建模正则化进行增强,以促进鲁棒且可解释的文本-运动对齐。在 HumanML3D 和 KIT-ML 数据集上进行的大量实验表明,我们的方法优于最先进的文本-运动检索方法,同时提供了文本与运动之间可解释的细粒度对应关系。代码可在补充材料中获取。