Text-to-motion synthesis aims to generate natural and expressive human motions from textual descriptions. While existing approaches primarily focus on generating holistic motions from text descriptions, they struggle to accurately reflect actions involving specific body parts. Recent part-wise motion generation methods attempt to resolve this but face two critical limitations: (i) they lack explicit mechanisms for aligning textual semantics with individual body parts, and (ii) they often generate incoherent full-body motions due to integrating independently generated part motions. To overcome these issues and resolve the fundamental trade-off in existing methods, we propose ParTY, a novel framework that enhances part expressiveness while generating coherent full-body motions. ParTY comprises: (1) Part-Guided Network, which first generates part motions to obtain part guidance, then uses it to generate holistic motions; (2) Part-aware Text Grounding, which diversely transforms text embeddings and appropriately aligns them with each body part; and (3) Holistic-Part Fusion, which adaptively fuses holistic motions and part motions. Extensive experiments, including part-level and coherence-level evaluations, demonstrate that ParTY achieves substantial improvements over previous methods.
翻译:文本到动作合成旨在从文本描述中生成自然且富有表现力的人体动作。现有方法主要侧重于根据文本描述生成整体动作,但难以准确反映涉及特定身体部位的动作。近期的部件级动作生成方法试图解决此问题,但面临两个关键局限:(i)缺乏将文本语义与各个身体部位对齐的显式机制;(ii)由于整合了独立生成的部件动作,常导致全身动作不协调。为克服这些问题并解决现有方法中的根本性权衡,我们提出ParTY——一种在生成协调全身动作的同时增强部件表现力的新型框架。ParTY包含:(1)部件引导网络:首先生成部件动作以获得部件引导,再利用其生成整体动作;(2)部件感知文本对齐:对文本嵌入进行多样化转换,并将其与各身体部位恰当对齐;(3)整体-部件融合:自适应地融合整体动作与部件动作。大量实验(包括部件级与协调性评估)表明,ParTY较先前方法实现了显著提升。