Human demonstrations as prompts are a powerful way to program robots to do long-horizon manipulation tasks. However, translating these demonstrations into robot-executable actions presents significant challenges due to execution mismatches in movement styles and physical capabilities. Existing methods for human-robot translation either depend on paired data, which is infeasible to scale, or rely heavily on frame-level visual similarities that often break down in practice. To address these challenges, we propose RHyME, a novel framework that automatically pairs human and robot trajectories using sequence-level optimal transport cost functions. Given long-horizon robot demonstrations, RHyME synthesizes semantically equivalent human videos by retrieving and composing short-horizon human clips. This approach facilitates effective policy training without the need for paired data. RHyME successfully imitates a range of cross-embodiment demonstrators, both in simulation and with a real human hand, achieving over 50% increase in task success compared to previous methods. We release our code and datasets at https://portal-cornell.github.io/rhyme/.
翻译:以人类演示作为提示是编程机器人执行长时程操作任务的有效方法。然而,由于运动风格与物理能力方面的执行失配,将这些演示转化为机器人可执行动作仍面临重大挑战。现有的人机动作迁移方法要么依赖难以规模化获取的配对数据,要么过度依赖实践中经常失效的帧级视觉相似性。为解决这些挑战,我们提出RHyME框架——一种利用序列级最优传输成本函数自动配对人类与机器人轨迹的新型方法。给定长时程机器人演示,RHyME通过检索并组合短时程人类动作片段来合成语义等价的人类演示视频。该方法无需配对数据即可实现有效的策略训练。RHyME在仿真环境和真实人手实验中均成功模仿了多种跨具身演示者,相比现有方法实现了超过50%的任务成功率提升。我们在https://portal-cornell.github.io/rhyme/发布了代码与数据集。