In this study, we tackle the complex task of generating 3D human-object interactions (HOI) from textual descriptions in a zero-shot text-to-3D manner. We identify and address two key challenges: the unsatisfactory outcomes of direct text-to-3D methods in HOI, largely due to the lack of paired text-interaction data, and the inherent difficulties in simultaneously generating multiple concepts with complex spatial relationships. To effectively address these issues, we present InterFusion, a two-stage framework specifically designed for HOI generation. InterFusion involves human pose estimations derived from text as geometric priors, which simplifies the text-to-3D conversion process and introduces additional constraints for accurate object generation. At the first stage, InterFusion extracts 3D human poses from a synthesized image dataset depicting a wide range of interactions, subsequently mapping these poses to interaction descriptions. The second stage of InterFusion capitalizes on the latest developments in text-to-3D generation, enabling the production of realistic and high-quality 3D HOI scenes. This is achieved through a local-global optimization process, where the generation of human body and object is optimized separately, and jointly refined with a global optimization of the entire scene, ensuring a seamless and contextually coherent integration. Our experimental results affirm that InterFusion significantly outperforms existing state-of-the-art methods in 3D HOI generation.
翻译:本研究致力于解决从文本描述以零样本文本到三维方式生成三维人-物交互这一复杂任务。我们识别并应对了两个关键挑战:直接文本到三维方法在人-物交互生成中效果欠佳,主要源于缺乏配对的文本-交互数据;以及同时生成具有复杂空间关系的多个概念所固有的困难。为有效解决这些问题,我们提出了InterFusion,一个专为人-物交互生成设计的两阶段框架。InterFusion利用从文本推导出的人体姿态估计作为几何先验,这简化了文本到三维的转换过程,并为精确的物体生成引入了额外约束。在第一阶段,InterFusion从一个描绘广泛交互的合成图像数据集中提取三维人体姿态,随后将这些姿态映射到交互描述。InterFusion的第二阶段利用了文本到三维生成领域的最新进展,能够生成逼真且高质量的三维人-物交互场景。这是通过一个局部-全局优化过程实现的,其中人体和物体的生成被分别优化,并通过整个场景的全局优化进行联合细化,确保了无缝且上下文连贯的整合。我们的实验结果证实,InterFusion在三维人-物交互生成方面显著优于现有的最先进方法。