In this paper, we introduce LGTM, a novel Local-to-Global pipeline for Text-to-Motion generation. LGTM utilizes a diffusion-based architecture and aims to address the challenge of accurately translating textual descriptions into semantically coherent human motion in computer animation. Specifically, traditional methods often struggle with semantic discrepancies, particularly in aligning specific motions to the correct body parts. To address this issue, we propose a two-stage pipeline to overcome this challenge: it first employs large language models (LLMs) to decompose global motion descriptions into part-specific narratives, which are then processed by independent body-part motion encoders to ensure precise local semantic alignment. Finally, an attention-based full-body optimizer refines the motion generation results and guarantees the overall coherence. Our experiments demonstrate that LGTM gains significant improvements in generating locally accurate, semantically-aligned human motion, marking a notable advancement in text-to-motion applications. Code and data for this paper are available at https://github.com/L-Sun/LGTM
翻译:本文提出LGTM,一种用于文本生成运动任务的局部到全局流水线。LGTM采用基于扩散的架构,旨在解决计算机动画中准确将文本描述转化为语义连贯的人体运动这一挑战。具体而言,传统方法常出现语义偏差,尤其是在将特定动作与正确身体部位对齐时。为解决该问题,我们提出两阶段流水线:首先利用大语言模型(LLMs)将全局运动描述分解为部位特定描述,再由独立的身体部位运动编码器进行处理,确保精确的局部语义对齐;最后,基于注意力机制的全身优化器对运动生成结果进行精炼,保证整体连贯性。实验表明,LGTM在生成局部准确、语义对齐的人体运动方面取得显著提升,标志着文本生成运动应用的重要进展。本文代码与数据见 https://github.com/L-Sun/LGTM