Naturally controllable human-scene interaction (HSI) generation has an important role in various fields, such as VR/AR content creation and human-centered AI. However, existing methods are unnatural and unintuitive in their controllability, which heavily limits their application in practice. Therefore, we focus on a challenging task of naturally and controllably generating realistic and diverse HSIs from textual descriptions. From human cognition, the ideal generative model should correctly reason about spatial relationships and interactive actions. To that end, we propose Narrator, a novel relationship reasoning-based generative approach using a conditional variation autoencoder for naturally controllable generation given a 3D scene and a textual description. Also, we model global and local spatial relationships in a 3D scene and a textual description respectively based on the scene graph, and introduce a partlevel action mechanism to represent interactions as atomic body part states. In particular, benefiting from our relationship reasoning, we further propose a simple yet effective multi-human generation strategy, which is the first exploration for controllable multi-human scene interaction generation. Our extensive experiments and perceptual studies show that Narrator can controllably generate diverse interactions and significantly outperform existing works. The code and dataset will be available for research purposes.
翻译:摘要:自然可控的人-场景交互生成在虚拟现实/增强现实内容创作和人本人工智能等多个领域具有重要作用。然而,现有方法在可控性方面存在不自然且不直观的问题,严重限制了其实际应用。为此,我们聚焦于从文本描述中自然可控地生成真实多样的人-场景交互这一挑战性任务。从人类认知角度出发,理想生成模型应当能正确推理空间关系与交互动作。基于此,我们提出Narrator——一种基于条件变分自编码器的新型关系推理生成方法,可在给定3D场景和文本描述时实现自然可控的生成。同时,我们分别基于场景图对3D场景中的全局与局部空间关系进行建模,并引入部件级动作机制将交互表示为原子化的身体部位状态。特别地,得益于关系推理能力,我们进一步提出一种简单而高效的多人物生成策略,这是可控多人物-场景交互生成领域的首次探索。大量实验与感知研究表明,Narrator能可控地生成多样交互,且显著优于现有方法。代码与数据集将开放供研究使用。