We have recently seen tremendous progress in realistic text-to-motion generation. Yet, the existing methods often fail or produce implausible motions with unseen text inputs, which limits the applications. In this paper, we present OMG, a novel framework, which enables compelling motion generation from zero-shot open-vocabulary text prompts. Our key idea is to carefully tailor the pretrain-then-finetune paradigm into the text-to-motion generation. At the pre-training stage, our model improves the generation ability by learning the rich out-of-domain inherent motion traits. To this end, we scale up a large unconditional diffusion model up to 1B parameters, so as to utilize the massive unlabeled motion data up to over 20M motion instances. At the subsequent fine-tuning stage, we introduce motion ControlNet, which incorporates text prompts as conditioning information, through a trainable copy of the pre-trained model and the proposed novel Mixture-of-Controllers (MoC) block. MoC block adaptively recognizes various ranges of the sub-motions with a cross-attention mechanism and processes them separately with the text-token-specific experts. Such a design effectively aligns the CLIP token embeddings of text prompts to various ranges of compact and expressive motion features. Extensive experiments demonstrate that our OMG achieves significant improvements over the state-of-the-art methods on zero-shot text-to-motion generation. Project page: https://tr3e.github.io/omg-page.
翻译:摘要:我们近期见证了文本到真实运动生成领域的巨大进展。然而,现有方法在面对未见文本输入时往往失败或生成不可信的运动,这限制了其应用。本文提出OMG这一新型框架,支持从零样本开放词汇文本提示中生成富有表现力的运动。核心思路是精心将预训练-微调范式适配到文本到运动生成任务中。在预训练阶段,模型通过学习丰富的域外固有运动特征提升生成能力。为此,我们构建了包含高达10亿参数的大规模无条件扩散模型,并利用超过2000万运动实例的海量无标注运动数据。在后续微调阶段,我们引入运动ControlNet,通过预训练模型的可训练副本及提出的新型控制器混合(MoC)模块,将文本提示作为条件信息融入生成。MoC模块通过交叉注意力机制自适应识别不同范围的子运动,并基于文本令牌专属专家分别处理。这种设计有效将文本提示的CLIP令牌嵌入与不同范围的紧凑且富有表现力的运动特征对齐。大量实验证明,我们的OMG在零样本文本到运动生成任务上较现有最优方法取得显著提升。项目页面:https://tr3e.github.io/omg-page