Physical Human-Scene Interaction (HSI) plays a crucial role in numerous applications. However, existing HSI techniques are limited to specific object dynamics and privileged information, which prevents the development of more comprehensive applications. To address this limitation, we introduce HumanVLA for general object rearrangement directed by practical vision and language. A teacher-student framework is utilized to develop HumanVLA. A state-based teacher policy is trained first using goal-conditioned reinforcement learning and adversarial motion prior. Then, it is distilled into a vision-language-action model via behavior cloning. We propose several key insights to facilitate the large-scale learning process. To support general object rearrangement by physical humanoid, we introduce a novel Human-in-the-Room dataset encompassing various rearrangement tasks. Through extensive experiments and analysis, we demonstrate the effectiveness of the proposed approach.
翻译:物理人-场景交互(HSI)在众多应用中发挥着关键作用。然而,现有的HSI技术受限于特定的物体动力学和特权信息,阻碍了更全面应用的发展。为克服这一局限,我们提出了HumanVLA,用于实现由实用视觉和语言引导的通用物体重排。HumanVLA采用师生框架进行开发:首先通过目标条件强化学习和对抗运动先验训练一个基于状态的教师策略,随后通过行为克隆将其蒸馏为一个视觉-语言-动作模型。我们提出了若干关键见解以促进大规模学习过程。为支持物理人形机器人执行通用物体重排,我们引入了一个包含多种重排任务的新型Human-in-the-Room数据集。通过大量实验与分析,我们验证了所提方法的有效性。