Recent advances in generative modeling and tokenization have driven significant progress in text-to-motion generation, leading to enhanced quality and realism in generated motions. However, effectively leveraging textual information for conditional motion generation remains an open challenge. We observe that current approaches, primarily relying on fixed-length text embeddings (e.g., CLIP) for global semantic injection, struggle to capture the composite nature of human motion, resulting in suboptimal motion quality and controllability. To address this limitation, we propose the Composite Aware Semantic Injection Mechanism (CASIM), comprising a composite-aware semantic encoder and a text-motion aligner that learns the dynamic correspondence between text and motion tokens. Notably, CASIM is model and representation-agnostic, readily integrating with both autoregressive and diffusion-based methods. Experiments on HumanML3D and KIT benchmarks demonstrate that CASIM consistently improves motion quality, text-motion alignment, and retrieval scores across state-of-the-art methods. Qualitative analyses further highlight the superiority of our composite-aware approach over fixed-length semantic injection, enabling precise motion control from text prompts and stronger generalization to unseen text inputs.
翻译:生成建模与分词技术的近期进展推动了文本到动作生成领域的显著进步,使得生成动作的质量与真实感得到提升。然而,如何有效利用文本信息进行条件动作生成仍是一个开放挑战。我们观察到,当前方法主要依赖固定长度的文本嵌入(如CLIP)进行全局语义注入,难以捕捉人体动作的复合特性,导致生成动作的质量与可控性欠佳。为应对这一局限,我们提出复合感知语义注入机制(CASIM),该机制包含一个复合感知语义编码器和一个文本-动作对齐器,后者学习文本与动作分词之间的动态对应关系。值得注意的是,CASIM与具体模型及表示方式无关,可无缝集成于自回归与基于扩散的两类方法中。在HumanML3D和KIT基准上的实验表明,CASIM能够持续提升现有先进方法的动作质量、文本-动作对齐度及检索分数。定性分析进一步凸显了我们的复合感知方法相较于固定长度语义注入的优越性,其能够通过文本提示实现精确的动作控制,并对未见文本输入展现出更强的泛化能力。