Synthesizing Diverse Human Motions in 3D Indoor Scenes

We present a novel method for populating 3D indoor scenes with virtual humans that can navigate in the environment and interact with objects in a realistic manner. Existing approaches rely on training sequences that contain captured human motions and the 3D scenes they interact with. However, such interaction data are costly, difficult to capture, and can hardly cover all plausible human-scene interactions in complex environments. To address these challenges, we propose a reinforcement learning-based approach that enables virtual humans to navigate in 3D scenes and interact with objects realistically and autonomously, driven by learned motion control policies. The motion control policies employ latent motion action spaces, which correspond to realistic motion primitives and are learned from large-scale motion capture data using a powerful generative motion model. For navigation in a 3D environment, we propose a scene-aware policy with novel state and reward designs for collision avoidance. Combined with navigation mesh-based path-finding algorithms to generate intermediate waypoints, our approach enables the synthesis of diverse human motions navigating in 3D indoor scenes and avoiding obstacles. To generate fine-grained human-object interactions, we carefully curate interaction goal guidance using a marker-based body representation and leverage features based on the signed distance field (SDF) to encode human-scene proximity relations. Our method can synthesize realistic and diverse human-object interactions (e.g.,~sitting on a chair and then getting up) even for out-of-distribution test scenarios with different object shapes, orientations, starting body positions, and poses. Experimental results demonstrate that our approach outperforms state-of-the-art methods in terms of both motion naturalness and diversity. Code and video results are available at: https://zkf1997.github.io/DIMOS.

翻译：我们提出了一种新方法，用于在3D室内场景中填充虚拟人物，使其能够在环境中导航并以逼真的方式与物体交互。现有方法依赖于包含捕捉到的人体运动及其交互的3D场景的训练序列。然而，此类交互数据成本高昂、难以捕捉，并且很难覆盖复杂环境中所有可能的人-场景交互。为应对这些挑战，我们提出了一种基于强化学习的方法，使虚拟人物能够通过学习的运动控制策略在3D场景中导航，并自主、逼真地与物体交互。运动控制策略采用潜在运动动作空间，该空间对应于逼真的运动基元，并通过强大的生成运动模型从大规模运动捕捉数据中学习得到。对于3D环境中的导航，我们提出了一种具有新颖状态和奖励设计的场景感知策略以实现碰撞避免。结合基于导航网格的路径查找算法生成中间路径点，我们的方法能够合成多样化的、在3D室内场景中导航并避开障碍物的人体运动。为生成细粒度的人-物体交互，我们使用基于标记的身体表示精心设计了交互目标引导，并利用基于符号距离场（SDF）的特征来编码人-场景邻近关系。我们的方法能够合成逼真且多样化的人-物体交互（例如，坐在椅子上然后站起来），即使对于具有不同物体形状、方向、起始身体位置和姿态的分布外测试场景也是如此。实验结果表明，我们的方法在运动自然性和多样性方面均优于现有最先进方法。代码和视频结果可访问：https://zkf1997.github.io/DIMOS。