MCM: Multi-condition Motion Synthesis Framework for Multi-scenario

The objective of the multi-condition human motion synthesis task is to incorporate diverse conditional inputs, encompassing various forms like text, music, speech, and more. This endows the task with the capability to adapt across multiple scenarios, ranging from text-to-motion and music-to-dance, among others. While existing research has primarily focused on single conditions, the multi-condition human motion generation remains underexplored. In this paper, we address these challenges by introducing MCM, a novel paradigm for motion synthesis that spans multiple scenarios under diverse conditions. The MCM framework is able to integrate with any DDPM-like diffusion model to accommodate multi-conditional information input while preserving its generative capabilities. Specifically, MCM employs two-branch architecture consisting of a main branch and a control branch. The control branch shares the same structure as the main branch and is initialized with the parameters of the main branch, effectively maintaining the generation ability of the main branch and supporting multi-condition input. We also introduce a Transformer-based diffusion model MWNet (DDPM-like) as our main branch that can capture the spatial complexity and inter-joint correlations in motion sequences through a channel-dimension self-attention module. Quantitative comparisons demonstrate that our approach achieves SoTA results in both text-to-motion and competitive results in music-to-dance tasks, comparable to task-specific methods. Furthermore, the qualitative evaluation shows that MCM not only streamlines the adaptation of methodologies originally designed for text-to-motion tasks to domains like music-to-dance and speech-to-gesture, eliminating the need for extensive network re-configurations but also enables effective multi-condition modal control, realizing "once trained is motion need".

翻译：多条件人体运动合成任务的目标是整合多种条件输入，涵盖文本、音乐、语音等多种形式。这使得该任务具备跨多个场景的适应能力，例如文本到运动、音乐到舞蹈等。尽管现有研究主要聚焦于单一条件，但多条件人体运动生成仍待深入探索。本文通过提出MCM（一种新颖的运动合成范式）应对这些挑战，该范式可在多种条件下覆盖多个场景。MCM框架能够与任何类DDPM扩散模型集成，在保持其生成能力的同时适应多条件信息输入。具体而言，MCM采用双分支架构，包含主分支和控制分支。控制分支与主分支共享相同结构，并使用主分支的参数进行初始化，从而有效维持主分支的生成能力并支持多条件输入。我们还引入了一种基于Transformer的扩散模型MWNet（类DDPM）作为主分支，该模型通过通道维度自注意力模块捕捉运动序列中的空间复杂性和关节点间相关性。定量比较表明，我们的方法在文本到运动任务中达到了SoTA结果，在音乐到舞蹈任务中取得了与特定任务方法相当的性能。此外，定性评估显示，MCM不仅简化了原用于文本到运动任务的方法向音乐到舞蹈、语音到手势等领域的适配过程（无需大量网络重构），还能实现有效的多条件模态控制，实现“一次训练即可满足运动需求”。