Human motion synthesis is a fundamental task in computer animation. Despite recent progress in this field utilizing deep learning and motion capture data, existing methods are always limited to specific motion categories, environments, and styles. This poor generalizability can be partially attributed to the difficulty and expense of collecting large-scale and high-quality motion data. At the same time, foundation models trained with internet-scale image and text data have demonstrated surprising world knowledge and reasoning ability for various downstream tasks. Utilizing these foundation models may help with human motion synthesis, which some recent works have superficially explored. However, these methods didn't fully unveil the foundation models' potential for this task and only support several simple actions and environments. In this paper, we for the first time, without any motion data, explore open-set human motion synthesis using natural language instructions as user control signals based on MLLMs across any motion task and environment. Our framework can be split into two stages: 1) sequential keyframe generation by utilizing MLLMs as a keyframe designer and animator; 2) motion filling between keyframes through interpolation and motion tracking. Our method can achieve general human motion synthesis for many downstream tasks. The promising results demonstrate the worth of mocap-free human motion synthesis aided by MLLMs and pave the way for future research.
翻译:人体运动合成是计算机动画领域的一项基础任务。尽管近期基于深度学习和动作捕捉数据的研究取得了进展,但现有方法始终受限于特定的运动类别、环境和风格。这种泛化能力不足的问题部分可归因于大规模高质量运动数据采集的难度与成本。与此同时,基于互联网规模图像与文本数据训练的基础模型已在各类下游任务中展现出惊人的世界知识与推理能力。利用这些基础模型可能有助于人体运动合成,近期已有研究对此进行了初步探索。然而,这些方法未能充分揭示基础模型在此任务中的潜力,且仅支持若干简单动作与环境。本文首次在无需任何运动数据的情况下,基于多模态大语言模型,探索以自然语言指令作为用户控制信号的开放式人体运动合成,可适用于任意运动任务与环境。我们的框架可分为两个阶段:1)利用多模态大语言模型作为关键帧设计与动画生成器,实现序列化关键帧生成;2)通过插值与运动跟踪完成关键帧间的运动填充。本方法能够为众多下游任务实现通用的人体运动合成。实验结果表明基于多模态大语言模型的无动作捕捉运动合成具有重要价值,并为未来研究开辟了道路。