Text-driven human motion generation in computer vision is both significant and challenging. However, current methods are limited to producing either deterministic or imprecise motion sequences, failing to effectively control the temporal and spatial relationships required to conform to a given text description. In this work, we propose a fine-grained method for generating high-quality, conditional human motion sequences supporting precise text description. Our approach consists of two key components: 1) a linguistics-structure assisted module that constructs accurate and complete language feature to fully utilize text information; and 2) a context-aware progressive reasoning module that learns neighborhood and overall semantic linguistics features from shallow and deep graph neural networks to achieve a multi-step inference. Experiments show that our approach outperforms text-driven motion generation methods on HumanML3D and KIT test sets and generates better visually confirmed motion to the text conditions.
翻译:文本驱动的人体运动生成在计算机视觉中既重要又具有挑战性。然而,现有方法仅限于生成确定或非精确的运动序列,未能有效控制符合给定文本描述所需的时间与空间关系。本文提出一种细粒度方法,用于生成支持精确文本描述的高质量条件人体运动序列。我们的方法包含两个关键组成部分:1)语言结构辅助模块,构建准确完整的语言特征以充分利用文本信息;2)上下文感知渐进推理模块,通过浅层与深层图神经网络学习邻域和整体语义语言学特征,实现多步推理。实验表明,本方法在HumanML3D与KIT测试集上优于文本驱动的运动生成方法,并生成与文本条件更视觉一致的运动。