We have recently seen tremendous progress in diffusion advances for generating realistic human motions. Yet, they largely disregard the rich multi-human interactions. In this paper, we present InterGen, an effective diffusion-based approach that incorporates human-to-human interactions into the motion diffusion process, which enables layman users to customize high-quality two-person interaction motions, with only text guidance. We first contribute a multimodal dataset, named InterHuman. It consists of about 107M frames for diverse two-person interactions, with accurate skeletal motions and 16,756 natural language descriptions. For the algorithm side, we carefully tailor the motion diffusion model to our two-person interaction setting. To handle the symmetry of human identities during interactions, we propose two cooperative transformer-based denoisers that explicitly share weights, with a mutual attention mechanism to further connect the two denoising processes. Then, we propose a novel representation for motion input in our interaction diffusion model, which explicitly formulates the global relations between the two performers in the world frame. We further introduce two novel regularization terms to encode spatial relations, equipped with a corresponding damping scheme during the training of our interaction diffusion model. Extensive experiments validate the effectiveness and generalizability of InterGen. Notably, it can generate more diverse and compelling two-person motions than previous methods and enables various downstream applications for human interactions.
翻译:我们近期见证了扩散模型在生成逼真人体运动方面的巨大进步。然而,这些方法大多忽略了丰富的多人交互。本文提出InterGen,一种有效的基于扩散的方法,将人与人之间的交互融入运动扩散过程,使非专业用户仅通过文本引导即可定制高质量的双人交互运动。我们首先贡献了一个多模态数据集InterHuman,包含约1.07亿帧的多样化双人交互数据,具有精确的骨骼运动及16,756条自然语言描述。在算法层面,我们精心调整了运动扩散模型以适应双人交互场景。为处理交互过程中人体身份的对称性,我们提出了两个基于Transformer的协同降噪器,明确共享权重,并采用互注意力机制进一步连接两个降噪过程。随后,我们提出了一种用于交互扩散模型中运动输入的新颖表示,明确表述了世界坐标系中两位表演者之间的全局关系。此外,我们引入两个新的正则化项来编码空间关系,并在交互扩散模型训练过程中配备相应的阻尼方案。大量实验验证了InterGen的有效性和泛化能力。值得注意的是,与先前方法相比,它能生成更多样且更引人入胜的双人运动,并支持人体交互的多种下游应用。