Conditional human motion generation is an important topic with many applications in virtual reality, gaming, and robotics. While prior works have focused on generating motion guided by text, music, or scenes, these typically result in isolated motions confined to short durations. Instead, we address the generation of long, continuous sequences guided by a series of varying textual descriptions. In this context, we introduce FlowMDM, the first diffusion-based model that generates seamless Human Motion Compositions (HMC) without any postprocessing or redundant denoising steps. For this, we introduce the Blended Positional Encodings, a technique that leverages both absolute and relative positional encodings in the denoising chain. More specifically, global motion coherence is recovered at the absolute stage, whereas smooth and realistic transitions are built at the relative stage. As a result, we achieve state-of-the-art results in terms of accuracy, realism, and smoothness on the Babel and HumanML3D datasets. FlowMDM excels when trained with only a single description per motion sequence thanks to its Pose-Centric Cross-ATtention, which makes it robust against varying text descriptions at inference time. Finally, to address the limitations of existing HMC metrics, we propose two new metrics: the Peak Jerk and the Area Under the Jerk, to detect abrupt transitions.
翻译:条件式人体运动生成是虚拟现实、游戏和机器人领域的重要课题,具有广泛应用前景。现有研究主要关注文本、音乐或场景引导下的运动生成,但通常仅产生短时孤立动作。为此,我们聚焦于由连续变化的文本描述引导的长序列运动生成。本文提出FlowMDM——首个无需后处理或冗余去噪步骤即可生成无缝人体运动组合(HMC)的扩散模型。我们创新性地引入混合位置编码技术,在去噪链中同时利用绝对位置编码与相对位置编码:绝对编码阶段恢复全局运动连贯性,相对编码阶段构建平滑自然的过渡。在Babel和HumanML3D数据集上,该方法的准确性、真实性和平滑性均达到最先进水平。通过仅使用单条描述训练运动序列时,提出的姿态中心交叉注意力(Pose-Centric Cross-ATtention)机制使模型在推理阶段对多变文本描述具有鲁棒性。最后,为解决现有HMC评估指标的局限,我们提出两个新指标:峰值急动度(Peak Jerk)和急动度曲线下面积(Area Under the Jerk),用于检测突变过渡。