SMooGPT: Stylized Motion Generation using Large Language Models

Stylized motion generation is actively studied in computer graphics, especially benefiting from the rapid advances in diffusion models. The goal of this task is to produce a novel motion respecting both the motion content and the desired motion style, e.g., ``walking in a loop like a Monkey''. Existing research attempts to address this problem via motion style transfer or conditional motion generation. They typically embed the motion style into a latent space and guide the motion implicitly in a latent space as well. Despite the progress, their methods suffer from low interpretability and control, limited generalization to new styles, and fail to produce motions other than ``walking'' due to the strong bias in the public stylization dataset. In this paper, we propose to solve the stylized motion generation problem from a new perspective of reasoning-composition-generation, based on our observations: i) human motion can often be effectively described using natural language in a body-part centric manner, ii) LLMs exhibit a strong ability to understand and reason about human motion, and iii) human motion has an inherently compositional nature, facilitating the new motion content or style generation via effective recomposing. We thus propose utilizing body-part text space as an intermediate representation, and present SMooGPT, a fine-tuned LLM, acting as a reasoner, composer, and generator when generating the desired stylized motion. Our method executes in the body-part text space with much higher interpretability, enabling fine-grained motion control, effectively resolving potential conflicts between motion content and style, and generalizes well to new styles thanks to the open-vocabulary ability of LLMs. Comprehensive experiments and evaluations, and a user perceptual study, demonstrate the effectiveness of our approach, especially under the pure text-driven stylized motion generation.

翻译：风格化运动生成在计算机图形学领域受到广泛研究，尤其得益于扩散模型的快速发展。该任务的目标是生成一种新颖的运动，既要符合运动内容，又要满足期望的运动风格，例如“像猴子一样循环行走”。现有研究试图通过运动风格迁移或条件运动生成来解决此问题。它们通常将运动风格嵌入到潜在空间中，并在潜在空间中进行隐式引导。尽管取得了进展，但这些方法存在可解释性和控制性低、对新风格泛化能力有限的问题，并且由于公共风格化数据集的强烈偏差，只能生成“行走”以外的运动。本文基于以下观察，提出从推理-组合-生成的新视角解决风格化运动生成问题：i）人体运动通常可以有效地以身体部位为中心的方式用自然语言描述；ii）大型语言模型（LLMs）展现出对运动理解和推理的强大能力；iii）人体运动具有固有的组合性质，便于通过有效重组生成新的运动内容或风格。因此，我们提出利用身体部位文本空间作为中间表示，并介绍了SMooGPT——一个经过微调的LLM，在生成所需风格化运动时充当推理器、组合器和生成器。我们的方法在身体部位文本空间中执行，具有更高的可解释性，能够实现细粒度的运动控制，有效解决运动内容与风格之间的潜在冲突，并且得益于LLMs的开放词汇能力，能够很好地泛化到新风格。全面的实验评估和用户感知研究表明了该方法的有效性，尤其是在纯文本驱动的风格化运动生成方面。