We address the problem of generating realistic 3D human-object interactions (HOIs) driven by textual prompts. To this end, we take a modular design and decompose the complex task into simpler sub-tasks. We first develop a dual-branch diffusion model (HOI-DM) to generate both human and object motions conditioned on the input text, and encourage coherent motions by a cross-attention communication module between the human and object motion generation branches. We also develop an affordance prediction diffusion model (APDM) to predict the contacting area between the human and object during the interactions driven by the textual prompt. The APDM is independent of the results by the HOI-DM and thus can correct potential errors by the latter. Moreover, it stochastically generates the contacting points to diversify the generated motions. Finally, we incorporate the estimated contacting points into the classifier-guidance to achieve accurate and close contact between humans and objects. To train and evaluate our approach, we annotate BEHAVE dataset with text descriptions. Experimental results on BEHAVE and OMOMO demonstrate that our approach produces realistic HOIs with various interactions and different types of objects.
翻译:我们研究了由文本提示驱动生成逼真三维人-物交互(HOI)的问题。为此,我们采用模块化设计,将复杂任务分解为更简单的子任务。首先,我们开发了一个双分支扩散模型(HOI-DM),用于在输入文本条件下生成人和物体的运动,并通过人-物运动生成分支之间的交叉注意力通信模块促进连贯运动。我们还开发了一个可供性预测扩散模型(APDM),用于预测文本提示驱动交互过程中人与物体之间的接触区域。APDM独立于HOI-DM的结果,因此能够纠正后者可能产生的错误。此外,它随机生成接触点,使生成的运动多样化。最后,我们将估计的接触点整合到分类器引导中,以实现人与物体之间的精确紧密接触。为了训练和评估我们的方法,我们为BEHAVE数据集标注了文本描述。在BEHAVE和OMOMO上的实验结果表明,我们的方法能够生成具有多种交互和不同类型物体的逼真HOI。