Human-object interaction (HOI) synthesis is important for various applications, ranging from virtual reality to robotics. However, acquiring 3D HOI data is challenging due to its complexity and high cost, limiting existing methods to the narrow diversity of object types and interaction patterns in training datasets. This paper proposes a novel zero-shot HOI synthesis framework without relying on end-to-end training on currently limited 3D HOI datasets. The core idea of our method lies in leveraging extensive HOI knowledge from pre-trained Multimodal Models. Given a text description, our system first obtains temporally consistent 2D HOI image sequences using image or video generation models, which are then uplifted to 3D HOI milestones of human and object poses. We employ pre-trained human pose estimation models to extract human poses and introduce a generalizable category-level 6-DoF estimation method to obtain the object poses from 2D HOI images. Our estimation method is adaptive to various object templates obtained from text-to-3D models or online retrieval. A physics-based tracking of the 3D HOI kinematic milestone is further applied to refine both body motions and object poses, yielding more physically plausible HOI generation results. The experimental results demonstrate that our method is capable of generating open-vocabulary HOIs with physical realism and semantic diversity.
翻译:人-物交互(HOI)合成对于从虚拟现实到机器人学等多种应用具有重要意义。然而,由于三维HOI数据的复杂性和高昂获取成本,现有方法通常受限于训练数据集中有限的物体类型与交互模式多样性。本文提出一种无需依赖当前有限三维HOI数据集进行端到端训练的新型零样本HOI合成框架。本方法的核心思想在于利用预训练多模态模型中蕴含的丰富HOI知识。给定文本描述后,本系统首先通过图像或视频生成模型获得时序一致的二维HOI图像序列,随后将其提升为包含人体姿态与物体姿态的三维HOI关键帧。我们采用预训练人体姿态估计模型提取人体姿态,并引入可泛化的类别级六自由度估计方法从二维HOI图像中获取物体姿态。该估计方法能自适应通过文本到三维模型或在线检索获取的多样化物体模板。进一步应用基于物理学的三维HOI运动学关键帧追踪来优化身体运动与物体姿态,从而生成更具物理合理性的HOI结果。实验表明,本方法能够生成兼具物理真实性与语义多样性的开放词汇HOI序列。