Versatile Motion Language Models for Multi-Turn Interactive Agents

Recent advancements in large language models (LLMs) have greatly enhanced their ability to generate natural and contextually relevant text, making AI interactions more human-like. However, generating and understanding interactive human-like motion, where two individuals engage in coordinated movements, remains a challenge due to the complexity of modeling these coordinated interactions. Furthermore, a versatile model is required to handle diverse interactive scenarios, such as chat systems that follow user instructions or adapt to their assigned role while adjusting interaction dynamics. To tackle this problem, we introduce VIM, short for the Versatile Interactive Motion language model, which integrates both language and motion modalities to effectively understand, generate, and control interactive motions in multi-turn conversational contexts. To address the scarcity of multi-turn interactive motion data, we introduce a synthetic dataset, INERT-MT2, where we utilize pre-trained models to create diverse instructional datasets with interactive motion. Our approach first trains a motion tokenizer that encodes interactive motions into residual discrete tokens. In the pretraining stage, the model learns to align motion and text representations with these discrete tokens. During the instruction fine-tuning stage, VIM adapts to multi-turn conversations using the INTER-MT2 dataset. We evaluate the versatility of our method across motion-related tasks, motion to text, text to motion, reaction generation, motion editing, and reasoning about motion sequences. The results highlight the versatility and effectiveness of proposed method in handling complex interactive motion synthesis.

翻译：近年来，大型语言模型（LLMs）的进步显著提升了其生成自然且上下文相关文本的能力，使得人工智能交互更加拟人化。然而，生成和理解交互式拟人运动——即两个个体进行协调动作——仍然是一个挑战，这源于建模此类协调交互的复杂性。此外，需要一种通用模型来处理多样化的交互场景，例如遵循用户指令或适应其分配角色，同时调整交互动态的聊天系统。为解决这一问题，我们提出了VIM（通用交互运动语言模型），该模型整合了语言和运动模态，以在多轮对话语境中有效理解、生成和控制交互运动。针对多轮交互运动数据稀缺的问题，我们引入了一个合成数据集INERT-MT2，利用预训练模型创建包含交互运动的多样化指令数据集。我们的方法首先训练一个运动分词器，将交互运动编码为残差离散标记。在预训练阶段，模型学习将这些离散标记与运动和文本表示对齐。在指令微调阶段，VIM使用INTER-MT2数据集适应多轮对话。我们在运动相关任务（包括运动到文本、文本到运动、反应生成、运动编辑和运动序列推理）上评估了方法的通用性。结果凸显了所提出方法在处理复杂交互运动合成方面的通用性和有效性。