We propose CG-HOI, the first method to address the task of generating dynamic 3D human-object interactions (HOIs) from text. We model the motion of both human and object in an interdependent fashion, as semantically rich human motion rarely happens in isolation without any interactions. Our key insight is that explicitly modeling contact between the human body surface and object geometry can be used as strong proxy guidance, both during training and inference. Using this guidance to bridge human and object motion enables generating more realistic and physically plausible interaction sequences, where the human body and corresponding object move in a coherent manner. Our method first learns to model human motion, object motion, and contact in a joint diffusion process, inter-correlated through cross-attention. We then leverage this learned contact for guidance during inference to synthesize realistic and coherent HOIs. Extensive evaluation shows that our joint contact-based human-object interaction approach generates realistic and physically plausible sequences, and we show two applications highlighting the capabilities of our method. Conditioned on a given object trajectory, we can generate the corresponding human motion without re-training, demonstrating strong human-object interdependency learning. Our approach is also flexible, and can be applied to static real-world 3D scene scans.
翻译:我们提出CG-HOI方法,这是首个从文本生成动态三维人-物交互序列的技术。我们将人与物体的运动建模为相互依存关系,因为具有丰富语义的人类运动极少脱离交互而孤立发生。核心洞察在于:显式建模人体表面与物体几何之间的接触关系,可在训练和推理阶段充当强代理引导信号。通过此类引导衔接人与物体的运动,能生成更真实且物理合理的交互序列,其中人体与对应物体以连贯方式协同运动。该方法首先在联合扩散过程中学习建模人体运动、物体运动及接触关系,三者通过交叉注意力机制相互关联;随后在推理阶段利用已学习的接触关系进行引导,以合成真实连贯的人-物交互。大量评估表明,我们基于接触的联合人-物交互方法能够生成真实且物理合理的序列,并通过两个应用示例展示该方法的能力:在给定物体轨迹的条件下,无需重新训练即可生成对应的人体运动,证实了强人-物相互依赖性学习能力;同时该方法具备灵活性,可应用于静态真实场景的三维扫描数据。