Current techniques face difficulties in generating motions from intricate semantic descriptions, primarily due to insufficient semantic annotations in datasets and weak contextual understanding. To address these issues, we present SemanticBoost, a novel framework that tackles both challenges simultaneously. Our framework comprises a Semantic Enhancement module and a Context-Attuned Motion Denoiser (CAMD). The Semantic Enhancement module extracts supplementary semantics from motion data, enriching the dataset's textual description and ensuring precise alignment between text and motion data without depending on large language models. On the other hand, the CAMD approach provides an all-encompassing solution for generating high-quality, semantically consistent motion sequences by effectively capturing context information and aligning the generated motion with the given textual descriptions. Distinct from existing methods, our approach can synthesize accurate orientational movements, combined motions based on specific body part descriptions, and motions generated from complex, extended sentences. Our experimental results demonstrate that SemanticBoost, as a diffusion-based method, outperforms auto-regressive-based techniques, achieving cutting-edge performance on the Humanml3D dataset while maintaining realistic and smooth motion generation quality.
翻译:当前技术在处理复杂语义描述生成动作时面临困难,主要由于数据集中语义标注不足以及上下文理解能力薄弱。为解决这些问题,我们提出SemanticBoost——一种同时应对上述挑战的新型框架。该框架包含语义增强模块与上下文感知动作去噪器(CAMD)。语义增强模块从动作数据中提取补充语义信息,丰富数据集的文本描述,在不依赖大型语言模型的前提下确保文本与动作数据的精准对齐。另一方面,CAMD方法通过有效捕获上下文信息并令生成动作与给定文本描述对齐,提供生成高质量、语义一致动作序列的全面解决方案。与现有方法不同,我们的方法能合成精确的方向性动作、基于特定身体部位描述的复合动作,以及由复杂长句生成的动作序列。实验结果表明,作为基于扩散的方法,SemanticBoost在Humanml3D数据集上超越自回归方法,在保持真实流畅动作生成质量的同时达到了最先进性能。