As a foundational task in human-centric cross-modal intelligence, motion-language retrieval aims to bridge the semantic gap between natural language and human motion, enabling intuitive motion analysis, yet existing approaches predominantly focus on aligning entire motion sequences with global textual representations. This global-centric paradigm overlooks fine-grained interactions between local motion segments and individual body joints and text tokens, inevitably leading to suboptimal retrieval performance. To address this limitation, we draw inspiration from the pyramidal process of human motion perception (from joint dynamics to segment coherence, and finally to holistic comprehension) and propose a novel Pyramidal Shapley-Taylor (PST) learning framework for fine-grained motion-language retrieval. Specifically, the framework decomposes human motion into temporal segments and spatial body joints, and learns cross-modal correspondences through progressive joint-wise and segment-wise alignment in a pyramidal fashion, effectively capturing both local semantic details and hierarchical structural relationships. Extensive experiments on multiple public benchmark datasets demonstrate that our approach significantly outperforms state-of-the-art methods, achieving precise alignment between motion segments and body joints and their corresponding text tokens. The code of this work will be released upon acceptance.
翻译:作为人本跨模态智能的基础任务,运动-语言检索旨在弥合自然语言与人体运动之间的语义鸿沟,从而实现直观的运动分析。然而,现有方法主要关注将完整运动序列与全局文本表示进行对齐。这种以全局为中心的范式忽视了局部运动片段、个体身体关节与文本标记之间的细粒度交互,不可避免地导致检索性能欠佳。为解决这一局限,我们受人类运动感知的金字塔式过程(从关节动态到片段连贯性,最终到整体理解)启发,提出了一种新颖的金字塔式沙普利-泰勒学习框架,用于细粒度运动-语言检索。具体而言,该框架将人体运动分解为时间片段和空间身体关节,并通过金字塔式的渐进关节级与片段级对齐来学习跨模态对应关系,从而有效捕捉局部语义细节和层次化结构关系。在多个公开基准数据集上的大量实验表明,我们的方法显著优于现有最先进方法,实现了运动片段、身体关节与其对应文本标记之间的精确对齐。本工作的代码将在论文录用后公开。