Human-scene Interaction (HSI) generation is a challenging task and crucial for various downstream tasks. However, one of the major obstacles is the limited data scale. High-quality data with simultaneously captured human and 3D environments is rare, resulting in limited data diversity and complexity. In this work, we argue that interaction with a scene is essentially interacting with the space occupancy of the scene from an abstract physical perspective, leading us to a unified novel view of Human-Occupancy Interaction. By treating pure motion sequences as records of humans interacting with invisible scene occupancy, we can aggregate motion-only data into a large-scale paired human-occupancy interaction database: Motion Occupancy Base (MOB). Thus, the need for costly paired motion-scene datasets with high-quality scene scans can be substantially alleviated. With this new unified view of Human-Occupancy interaction, a single motion controller is proposed to reach the target state given the surrounding occupancy. Once trained on MOB with complex occupancy layout, the controller could handle cramped scenes and generalize well to general scenes with limited complexity. With no GT 3D scenes for training, our method can generate realistic and stable HSI motions in diverse scenarios, including both static and dynamic scenes. Our code and data would be made publicly available at https://foruck.github.io/occu-page/.
翻译:人体-场景交互(HSI)生成是一项具有挑战性的任务,对于众多下游应用至关重要。然而,其主要障碍之一是数据规模有限。同时捕捉人体与三维环境的高质量数据极为稀缺,导致数据多样性与复杂性受限。本文提出,从抽象物理视角来看,交互本质上是与场景空间占用的交互,由此引出了“人体-占用交互”的统一新视角。通过将纯运动序列视为人体与不可见场景占用交互的记录,我们能将仅含运动的数据聚合为大规模配对的人体-占用交互数据库:运动占用基库(MOB)。因此,对配备高质量场景扫描的高成本配对运动-场景数据集的需求可被大幅缓解。基于这一人体-占用交互的新统一视角,我们提出一种单一运动控制器,使其能够根据周围占用状态达到目标姿态。该控制器在包含复杂占用布局的MOB上训练后,可处理拥挤场景,并泛化至复杂度有限的通用场景。尽管训练过程中未使用真实三维场景,我们的方法仍能在包括静态与动态场景在内的多样化环境中生成逼真且稳定的人体-场景交互运动。我们的代码与数据将在 https://foruck.github.io/occu-page/ 公开提供。