Traditional approaches in physics-based motion generation, centered around imitation learning and reward shaping, often struggle to adapt to new scenarios. To tackle this limitation, we propose AnySkill, a novel hierarchical method that learns physically plausible interactions following open-vocabulary instructions. Our approach begins by developing a set of atomic actions via a low-level controller trained via imitation learning. Upon receiving an open-vocabulary textual instruction, AnySkill employs a high-level policy that selects and integrates these atomic actions to maximize the CLIP similarity between the agent's rendered images and the text. An important feature of our method is the use of image-based rewards for the high-level policy, which allows the agent to learn interactions with objects without manual reward engineering. We demonstrate AnySkill's capability to generate realistic and natural motion sequences in response to unseen instructions of varying lengths, marking it the first method capable of open-vocabulary physical skill learning for interactive humanoid agents.
翻译:传统基于物理的动效生成方法(如模仿学习和奖励塑形)常难以适应新场景。为解决这一局限,我们提出AnySkill——一种新型层次化方法,使智能体能够遵循开放词汇指令执行具有物理合理性的交互动作。该方法首先通过模仿学习训练的低层控制器构建原子动作库。当接收到开放词汇文本指令时,AnySkill采用高层策略选择并整合这些原子动作,以最大化智能体渲染图像与文本之间的CLIP相似度。本方法的关键特征在于为高层策略使用基于图像的奖励函数,使智能体无需手动设计奖励即可学习与物体的交互。实验证明,AnySkill能针对不同长度的未见指令生成逼真自然的动作序列,这标志着该方法首次实现了交互式人形智能体的开放词汇物理技能学习。