Text-conditioned human motion generation model has achieved great progress by introducing diffusion models and corresponding control signals. However, the interaction between humans are still under explored. To model interactions of arbitrary number of humans, we define interactions as human joint pairs that are either in contact or separated, and leverage {\em Large Language Model (LLM) Planner} to translate interaction descriptions into contact plans. Based on the contact plans, interaction generation could be achieved by spatially controllable motion generation methods by taking joint contacts as spatial conditions. We present a novel approach named InterControl for flexible spatial control of every joint in every person at any time by leveraging motion diffusion model only trained on single-person data. We incorporate a motion controlnet to generate coherent and realistic motions given sparse spatial control signals and a loss guidance module to precisely align any joint to the desired position in a classifier guidance manner via Inverse Kinematics (IK). Extensive experiments on HumanML3D and KIT-ML dataset demonstrate its effectiveness in versatile joint control. We also collect data of joint contact pairs by LLMs to show InterControl's ability in human interaction generation.
翻译:基于文本条件的人体动作生成模型通过引入扩散模型及相应控制信号取得了显著进展。然而,人体间的互动生成仍待深入探索。为建模任意数量人体间的互动,我们将互动定义为处于接触或分离状态的关节点对,并利用**大语言模型规划器**将互动描述转化为接触计划。基于此,通过将关节接触作为空间条件,可采用空间可控动作生成方法实现互动生成。我们提出名为InterControl的新方法,该方法仅需基于单人数据训练的扩散模型,即可灵活实现任意时间点对每个人每个关节的空间控制。通过引入动作控制网络生成稀疏空间控制信号下的连贯逼真动作,并设计损失引导模块,以分类器引导方式结合逆运动学将任意关节精确对齐至目标位置。在HumanML3D与KIT-ML数据集上的大量实验验证了该方法在多关节控制中的有效性。此外,我们利用大语言模型收集关节点接触对数据,展示了InterControl在人体互动生成中的潜力。