We introduce Unimotion, the first unified multi-task human motion model capable of both flexible motion control and frame-level motion understanding. While existing works control avatar motion with global text conditioning, or with fine-grained per frame scripts, none can do both at once. In addition, none of the existing works can output frame-level text paired with the generated poses. In contrast, Unimotion allows to control motion with global text, or local frame-level text, or both at once, providing more flexible control for users. Importantly, Unimotion is the first model which by design outputs local text paired with the generated poses, allowing users to know what motion happens and when, which is necessary for a wide range of applications. We show Unimotion opens up new applications: 1.) Hierarchical control, allowing users to specify motion at different levels of detail, 2.) Obtaining motion text descriptions for existing MoCap data or YouTube videos 3.) Allowing for editability, generating motion from text, and editing the motion via text edits. Moreover, Unimotion attains state-of-the-art results for the frame-level text-to-motion task on the established HumanML3D dataset. The pre-trained model and code are available available on our project page at https://coral79.github.io/uni-motion/.
翻译:我们提出了Unimotion,这是首个能够同时实现灵活运动控制和帧级运动理解的多任务统一人体运动模型。现有工作主要通过全局文本条件控制虚拟角色运动,或通过细粒度的逐帧脚本来实现,但均无法同时完成这两项任务。此外,现有方法均无法输出与生成姿态配对的帧级文本描述。相比之下,Unimotion允许通过全局文本、局部帧级文本或两者结合的方式来控制运动,为用户提供更灵活的控制能力。重要的是,Unimotion是首个通过设计直接输出与生成姿态配对的局部文本的模型,使用户能够准确了解运动内容及其发生时序,这对广泛的应用场景至关重要。我们展示了Unimotion开启的新应用方向:1)层级化控制,允许用户在不同细节层级指定运动;2)为现有动作捕捉数据或YouTube视频生成运动文本描述;3)实现可编辑性,支持从文本生成运动并通过文本修改进行编辑。此外,在权威的HumanML3D数据集上,Unimotion在帧级文本到运动任务中取得了最先进的结果。预训练模型和代码已在项目页面https://coral79.github.io/uni-motion/发布。