We introduce Action-GPT, a plug-and-play framework for incorporating Large Language Models (LLMs) into text-based action generation models. Action phrases in current motion capture datasets contain minimal and to-the-point information. By carefully crafting prompts for LLMs, we generate richer and fine-grained descriptions of the action. We show that utilizing these detailed descriptions instead of the original action phrases leads to better alignment of text and motion spaces. We introduce a generic approach compatible with stochastic (e.g. VAE-based) and deterministic (e.g. MotionCLIP) text-to-motion models. In addition, the approach enables multiple text descriptions to be utilized. Our experiments show (i) noticeable qualitative and quantitative improvement in the quality of synthesized motions, (ii) benefits of utilizing multiple LLM-generated descriptions, (iii) suitability of the prompt function, and (iv) zero-shot generation capabilities of the proposed approach. Project page: https://actiongpt.github.io
翻译:我们提出Action-GPT,一种将大规模语言模型(LLM)集成到基于文本的动作生成模型中的即插即用框架。当前运动捕捉数据集中的动作短语包含极简且精炼的信息。通过精心设计LLM的提示词,我们生成了更丰富、更细粒度的动作描述。研究表明,使用这些详细描述替代原始动作短语,可提升文本与运动空间的语义对齐效果。我们提出了一种通用方法,兼容随机型(如基于VAE)和确定型(如MotionCLIP)文本到运动模型。此外,该方法支持利用多个文本描述。实验结果表明:(i)合成运动质量在定性和定量上均有显著提升;(ii)利用多个LLM生成描述的优势;(iii)提示函数的适用性;(iv)所提方法的零样本生成能力。项目页面:https://actiongpt.github.io