Human demonstrations as prompts are a powerful way to program robots to do long-horizon manipulation tasks. However, directly translating such demonstrations into robot-executable actions poses significant challenges due to execution mismatches, such as different movement styles and physical capabilities. Existing methods either rely on robot-demonstrator paired data, which is infeasible to scale, or overly rely on frame-level visual similarities, which fail to hold. To address these challenges, we propose RHyME, a novel framework that automatically establishes task execution correspondences between the robot and the demonstrator by using optimal transport costs. Given long-horizon robot demonstrations, RHyME synthesizes semantically equivalent human demonstrations by retrieving and composing similar short-horizon human clips, facilitating effective policy training without the need for paired data. We show that RHyME outperforms a range of baselines across various cross-embodiment datasets on all degrees of mismatches. Through detailed analysis, we uncover insights for learning and leveraging cross-embodiment visual representations.
翻译:以人类演示作为提示是编程机器人执行长时程操作任务的有效方法。然而,由于执行失配(如运动风格和物理能力差异),直接将此类演示转化为机器人可执行动作面临重大挑战。现有方法要么依赖机器人-演示者配对数据(难以规模化),要么过度依赖帧级视觉相似性(往往失效)。为应对这些挑战,我们提出RHyME框架,该框架通过最优传输成本自动建立机器人与演示者之间的任务执行对应关系。给定长时程机器人演示,RHyME通过检索并组合相似的短时程人类演示片段,合成语义等价的人类演示,从而无需配对数据即可实现有效的策略训练。实验表明,在多种跨具身数据集的所有失配程度上,RHyME均优于一系列基线方法。通过深入分析,我们揭示了学习和利用跨具身视觉表征的内在机理。