We address the problem of generating realistic 3D human-object interactions (HOIs) driven by textual prompts. Instead of a single model, our key insight is to take a modular design and decompose the complex task into simpler sub-tasks. We first develop a dual-branch diffusion model (HOI-DM) to generate both human and object motions conditioning on the input text, and encourage coherent motions by a cross-attention communication module between the human and object motion generation branches. We also develop an affordance prediction diffusion model (APDM) to predict the contacting area between the human and object during the interactions driven by the textual prompt. The APDM is independent of the results by the HOI-DM and thus can correct potential errors by the latter. Moreover, it stochastically generates the contacting points to diversify the generated motions. Finally, we incorporate the estimated contacting points into the classifier-guidance to achieve accurate and close contact between humans and objects. To train and evaluate our approach, we annotate BEHAVE dataset with text descriptions. Experimental results demonstrate that our approach is able to produce realistic HOIs with various interactions and different types of objects.
翻译:我们解决由文本提示驱动的真实感三维人-物交互(HOI)生成问题。与单一模型不同,我们的关键思路是采用模块化设计,将复杂任务分解为更简单的子任务。首先,我们开发了一个双分支扩散模型(HOI-DM),用于根据输入文本生成人体和物体的运动,并通过人体与物体运动生成分支之间的交叉注意力通信模块促进运动的一致性。我们还开发了一个可供性预测扩散模型(APDM),用于预测文本提示驱动交互过程中人体与物体之间的接触区域。APDM独立于HOI-DM的结果,因此能够纠正后者可能产生的误差。此外,它通过随机生成接触点来增加生成运动的多样性。最后,我们将估计的接触点融入分类器引导中,以实现人体与物体之间的精确紧密接触。为了训练和评估我们的方法,我们为BEHAVE数据集标注了文本描述。实验结果表明,我们的方法能够生成包含多种交互方式及不同类型物体的真实感人-物交互。