Text-to-motion generation has advanced with diffusion models, yet existing systems often collapse complex multi-action prompts into a single embedding, leading to omissions, reordering, or unnatural transitions. In this work, we shift perspective by introducing a principled definition of an event as the smallest semantically self-contained action or state change in a text prompt that can be temporally aligned with a motion segment. Building on this definition, we propose Event-T2M, a diffusion-based framework that decomposes prompts into events, encodes each with a motion-aware retrieval model, and integrates them through event-based cross-attention in Conformer blocks. Existing benchmarks mix simple and multi-event prompts, making it unclear whether models that succeed on single actions generalize to multi-action cases. To address this, we construct HumanML3D-E, the first benchmark stratified by event count. Experiments on HumanML3D, KIT-ML, and HumanML3D-E show that Event-T2M matches state-of-the-art baselines on standard tests while outperforming them as event complexity increases. Human studies validate the plausibility of our event definition, the reliability of HumanML3D-E, and the superiority of Event-T2M in generating multi-event motions that preserve order and naturalness close to ground-truth. These results establish event-level conditioning as a generalizable principle for advancing text-to-motion generation beyond single-action prompts.
翻译:文本到动作生成技术虽已借助扩散模型取得进展,但现有系统常将复杂的多动作提示压缩为单一嵌入,导致动作遗漏、顺序错乱或过渡不自然。本研究通过引入一个原则性的事件定义来转变视角:事件是文本提示中语义上自包含的最小动作或状态变化单元,可与动作片段在时间上对齐。基于此定义,我们提出Event-T2M——一个基于扩散的框架,该框架将提示分解为事件,通过运动感知检索模型编码每个事件,并在Conformer模块中通过基于事件的交叉注意力进行整合。现有基准测试混合了简单提示与多事件提示,使得在单动作上成功的模型能否推广至多动作场景尚不明确。为此,我们构建了首个按事件数量分层的人工标注基准HumanML3D-E。在HumanML3D、KIT-ML和HumanML3D-E上的实验表明,Event-T2M在标准测试中与先进基线模型性能相当,且随着事件复杂度增加,其表现优于基线。人工研究验证了我们事件定义的合理性、HumanML3D-E的可靠性,以及Event-T2M在生成多事件动作时,在保持顺序和自然度方面接近真实数据的优越性。这些结果确立了事件级条件生成作为一种可推广的原则,能够推动文本到动作生成技术超越单动作提示的局限。