We propose CG-HOI, the first method to address the task of generating dynamic 3D human-object interactions (HOIs) from text. We model the motion of both human and object in an interdependent fashion, as semantically rich human motion rarely happens in isolation without any interactions. Our key insight is that explicitly modeling contact between the human body surface and object geometry can be used as strong proxy guidance, both during training and inference. Using this guidance to bridge human and object motion enables generating more realistic and physically plausible interaction sequences, where the human body and corresponding object move in a coherent manner. Our method first learns to model human motion, object motion, and contact in a joint diffusion process, inter-correlated through cross-attention. We then leverage this learned contact for guidance during inference synthesis of realistic, coherent HOIs. Extensive evaluation shows that our joint contact-based human-object interaction approach generates realistic and physically plausible sequences, and we show two applications highlighting the capabilities of our method. Conditioned on a given object trajectory, we can generate the corresponding human motion without re-training, demonstrating strong human-object interdependency learning. Our approach is also flexible, and can be applied to static real-world 3D scene scans.
翻译:我们提出CG-HOI,这是首个解决从文本生成动态三维人-物交互(HOI)任务的方法。我们以相互依赖的方式对人体和物体的运动进行建模,因为语义丰富的人体运动很少在无任何交互的孤立状态下发生。我们的关键见解是,对人体表面与物体几何之间的接触进行显式建模,可在训练和推理过程中充当强大的代理引导。利用这种引导来桥接人体与物体的运动,能够生成更真实且物理合理的人-物交互序列,其中人体和对应物体以连贯方式运动。我们的方法首先通过联合扩散过程(通过交叉注意力相互关联)学习对人体运动、物体运动和接触进行建模,然后利用学习到的接触引导,在推理合成过程中生成真实且连贯的HOI。大量评估表明,我们基于接触的人-物交互联合方法能够生成真实且物理合理的序列,并展示了两项突显该方法能力的应用。在给定物体运动轨迹的条件下,该方法无需重新训练即可生成对应的人体运动,展现了强大的人-物相互依赖学习能力。此外,我们的方法具有灵活性,可应用于静态的真实世界三维场景扫描。