We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models. Previous text-to-motion methods focus on characters in isolation without considering scenes due to the limited availability of datasets that include motion, text descriptions, and interactive scenes. Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model, emphasizing goal-reaching constraints on large-scale motion-capture datasets. We then enhance this model with a scene-aware component, fine-tuned using data augmented with detailed scene information, including ground plane and object shapes. To facilitate training, we embed annotated navigation and interaction motions within scenes. The proposed method produces realistic and diverse human-object interactions, such as navigation and sitting, in different scenes with various object shapes, orientations, initial body positions, and poses. Extensive experiments demonstrate that our approach surpasses prior techniques in terms of the plausibility of human-scene interactions, as well as the realism and variety of the generated motions. Code will be released upon publication of this work at https://research.nvidia.com/labs/toronto-ai/tesmo.
翻译:我们提出TeSMo方法,一种基于去噪扩散模型的文本驱动场景感知动作生成技术。现有文本到动作方法多聚焦于孤立角色,未考虑场景因素,主要受限于包含动作、文本描述与交互场景的数据集匮乏。我们的方法首先预训练一个场景无关的文本到动作扩散模型,通过在大规模动作捕捉数据集上施加目标到达约束来强化模型。随后引入场景感知模块,利用包含地面几何与物体形状的详细场景信息增强数据微调模型。为促进训练,我们将标注的导航与交互动作嵌入场景中。该方法能在不同几何形状、朝向、初始身体位置与姿态的场景中生成逼真多样的人-物交互动作(如导航与坐姿)。大量实验表明,本方法在人-场景交互合理性、生成动作真实性与多样性方面均超越现有技术。代码将于工作发表后在https://research.nvidia.com/labs/toronto-ai/tesmo开源。