Current techniques face difficulties in generating motions from intricate semantic descriptions, primarily due to insufficient semantic annotations in datasets and weak contextual understanding. To address these issues, we present SemanticBoost, a novel framework that tackles both challenges simultaneously. Our framework comprises a Semantic Enhancement module and a Context-Attuned Motion Denoiser (CAMD). The Semantic Enhancement module extracts supplementary semantics from motion data, enriching the dataset's textual description and ensuring precise alignment between text and motion data without depending on large language models. On the other hand, the CAMD approach provides an all-encompassing solution for generating high-quality, semantically consistent motion sequences by effectively capturing context information and aligning the generated motion with the given textual descriptions. Distinct from existing methods, our approach can synthesize accurate orientational movements, combined motions based on specific body part descriptions, and motions generated from complex, extended sentences. Our experimental results demonstrate that SemanticBoost, as a diffusion-based method, outperforms auto-regressive-based techniques, achieving cutting-edge performance on the Humanml3D dataset while maintaining realistic and smooth motion generation quality.
翻译:当前技术在处理复杂语义描述的动作生成时面临困难,主要源于数据集中语义标注不足以及上下文理解能力薄弱。为解决这些问题,我们提出SemanticBoost——一种同时应对双重挑战的新型框架。该框架包含语义增强模块与上下文感知动作去噪器(CAMD)。语义增强模块从运动数据中提取补充语义信息,在不依赖大型语言模型的情况下丰富数据集的文本描述,确保文本与动作数据间的精确对齐。而CAMD方法通过有效捕获上下文信息并将生成动作与给定文本描述对齐,为生成高质量、语义一致的动作序列提供了全面解决方案。不同于现有方法,我们的方法能够合成精确的方向性动作、基于特定身体部位描述的复合动作,以及由复杂长句生成的动作序列。实验结果表明,作为基于扩散的方法,SemanticBoost优于自回归类方法,在Humanml3D数据集上实现了顶尖性能,同时保持了真实流畅的动作生成质量。