Diffusion models, particularly latent diffusion models, have demonstrated remarkable success in text-driven human motion generation. However, it remains challenging for latent diffusion models to effectively compose multiple semantic concepts into a single, coherent motion sequence. To address this issue, we propose EnergyMoGen, which includes two spectrums of Energy-Based Models: (1) We interpret the diffusion model as a latent-aware energy-based model that generates motions by composing a set of diffusion models in latent space; (2) We introduce a semantic-aware energy model based on cross-attention, which enables semantic composition and adaptive gradient descent for text embeddings. To overcome the challenges of semantic inconsistency and motion distortion across these two spectrums, we introduce Synergistic Energy Fusion. This design allows the motion latent diffusion model to synthesize high-quality, complex motions by combining multiple energy terms corresponding to textual descriptions. Experiments show that our approach outperforms existing state-of-the-art models on various motion generation tasks, including text-to-motion generation, compositional motion generation, and multi-concept motion generation. Additionally, we demonstrate that our method can be used to extend motion datasets and improve the text-to-motion task.
翻译:扩散模型,尤其是潜在扩散模型,在文本驱动的人体运动生成方面已展现出显著成功。然而,对于潜在扩散模型而言,如何将多个语义概念有效地组合成单一、连贯的运动序列仍然具有挑战性。为解决此问题,我们提出了EnergyMoGen,它包含两个谱系的基于能量的模型:(1)我们将扩散模型解释为一种潜在感知的能量模型,通过在潜在空间中组合一组扩散模型来生成运动;(2)我们引入了一种基于交叉注意力的语义感知能量模型,该模型能够实现语义组合和文本嵌入的自适应梯度下降。为克服这两个谱系中存在的语义不一致和运动失真挑战,我们引入了协同能量融合。该设计使得运动潜在扩散模型能够通过组合对应于文本描述的多个能量项,合成高质量、复杂的运动。实验表明,我们的方法在多种运动生成任务上优于现有的最先进模型,包括文本到运动生成、组合式运动生成以及多概念运动生成。此外,我们证明了我们的方法可用于扩展运动数据集并改进文本到运动任务。