Revisit Human-Scene Interaction via Space Occupancy

from arxiv, To appear in ECCV 2024. The first two authors contributed equally. Yong-Lu Li is the corresponding author. Project page: https://foruck.github.io/occu-page/

Human-scene Interaction (HSI) generation is a challenging task and crucial for various downstream tasks. However, one of the major obstacles is its limited data scale. High-quality data with simultaneously captured human and 3D environments is hard to acquire, resulting in limited data diversity and complexity. In this work, we argue that interaction with a scene is essentially interacting with the space occupancy of the scene from an abstract physical perspective, leading us to a unified novel view of Human-Occupancy Interaction. By treating pure motion sequences as records of humans interacting with invisible scene occupancy, we can aggregate motion-only data into a large-scale paired human-occupancy interaction database: Motion Occupancy Base (MOB). Thus, the need for costly paired motion-scene datasets with high-quality scene scans can be substantially alleviated. With this new unified view of Human-Occupancy interaction, a single motion controller is proposed to reach the target state given the surrounding occupancy. Once trained on MOB with complex occupancy layout, which is stringent to human movements, the controller could handle cramped scenes and generalize well to general scenes with limited complexity like regular living rooms. With no GT 3D scenes for training, our method can generate realistic and stable HSI motions in diverse scenarios, including both static and dynamic scenes. The project is available at https://foruck.github.io/occu-page/.

翻译：人-场景交互（HSI）生成是一项具有挑战性且对多种下游任务至关重要的任务。然而，其主要障碍之一在于数据规模的局限性。同时捕获人体与三维环境的高质量数据难以获取，导致数据多样性与复杂性受限。本研究提出，从抽象物理视角来看，与场景的交互本质上是对场景空间占位的交互，这引导我们建立一种统一的新视角——人-占位交互。通过将纯动作序列视为人体与不可见场景占位交互的记录，我们可以将仅含动作的数据聚合为大规模配对的人-占位交互数据库：动作占位基准库（MOB）。因此，对需要高质量场景扫描的昂贵配对动作-场景数据集的需求可得到显著缓解。基于这一统一的人-占位交互新视角，我们提出一种单一动作控制器，使其能在给定周边占位条件下达到目标状态。该控制器在MOB上经过具有复杂占位布局（对人体运动形成严格约束）的训练后，能够处理狭窄场景，并良好泛化至常规客厅等复杂度有限的普通场景。在无真实三维场景训练数据的情况下，我们的方法能在静态与动态场景在内的多样情境中生成真实且稳定的HSI动作。项目页面详见 https://foruck.github.io/occu-page/。