Scenes are continuously undergoing dynamic changes in the real world. However, existing human-scene interaction generation methods typically treat the scene as static, which deviates from reality. Inspired by world models, we introduce Dyn-HSI, the first cognitive architecture for dynamic human-scene interaction, which endows virtual humans with three humanoid components. (1)Vision (human eyes): we equip the virtual human with a Dynamic Scene-Aware Navigation, which continuously perceives changes in the surrounding environment and adaptively predicts the next waypoint. (2)Memory (human brain): we equip the virtual human with a Hierarchical Experience Memory, which stores and updates experiential data accumulated during training. This allows the model to leverage prior knowledge during inference for context-aware motion priming, thereby enhancing both motion quality and generalization. (3) Control (human body): we equip the virtual human with Human-Scene Interaction Diffusion Model, which generates high-fidelity interaction motions conditioned on multimodal inputs. To evaluate performance in dynamic scenes, we extend the existing static human-scene interaction datasets to construct a dynamic benchmark, Dyn-Scenes. We conduct extensive qualitative and quantitative experiments to validate Dyn-HSI, showing that our method consistently outperforms existing approaches and generates high-quality human-scene interaction motions in both static and dynamic settings.
翻译:现实世界中的场景持续经历动态变化。然而,现有人-场景交互生成方法通常将场景视为静态,这与实际情况不符。受世界模型启发,我们提出了Dyn-HSI,首个用于动态人-场景交互的认知架构,该架构赋予虚拟人三种类人组件。(1)视觉(人眼):我们为虚拟人配备了动态场景感知导航模块,该模块持续感知周围环境的变化并自适应地预测下一个路径点。(2)记忆(人脑):我们为虚拟人配备了分层经验记忆模块,该模块存储并更新训练过程中积累的经验数据。这使得模型能够在推理过程中利用先验知识进行上下文感知的运动初始化,从而提升运动质量和泛化能力。(3)控制(人体):我们为虚拟人配备了人-场景交互扩散模型,该模型能够基于多模态输入生成高保真度的交互动作。为了评估动态场景中的性能,我们扩展了现有的静态人-场景交互数据集,构建了动态基准Dyn-Scenes。我们进行了广泛的定性和定量实验以验证Dyn-HSI,结果表明我们的方法在静态和动态设置下均持续优于现有方法,并能生成高质量的人-场景交互动作。