End-to-end learning is emerging as a powerful paradigm for robotic manipulation, but its effectiveness is limited by data scarcity and the heterogeneity of action spaces across robot embodiments. In particular, diverse action spaces across different end-effectors create barriers for cross-embodiment learning and skill transfer. We address this challenge through diffusion policies learned in a latent action space that unifies diverse end-effector actions. We first show that we can learn a semantically aligned latent action space for anthropomorphic robotic hands, a human hand, and a parallel jaw gripper using encoders trained with a contrastive loss. Second, we show that by using our proposed latent action space for co-training on manipulation data from different end-effectors, we can utilize a single policy for multi-robot control and obtain up to 25.3% improved manipulation success rates, indicating successful skill transfer despite a significant embodiment gap. Our approach using latent cross-embodiment policies presents a new method to unify different action spaces across embodiments, enabling efficient multi-robot control and data sharing across robot setups. This unified representation significantly reduces the need for extensive data collection for each new robot morphology, accelerates generalization across embodiments, and ultimately facilitates more scalable and efficient robotic learning.
翻译:端到端学习正成为机器人操控领域的一种强大范式,但其有效性受限于数据稀缺性以及不同机器人具身之间动作空间的异质性。尤其是不同末端执行器间的多样化动作空间,为跨具身学习和技能迁移设置了障碍。我们通过在统一多样化末端执行器动作的隐式动作空间中学习扩散策略来应对这一挑战。首先,我们展示了能够使用对比损失训练的编码器,为拟人机器人手、人类手和平行爪夹持器学习语义对齐的隐式动作空间。其次,我们证明,通过利用所提出的隐式动作空间对不同末端执行器的操控数据进行协同训练,可以用单一策略实现多机器人控制,并将操控成功率提升高达25.3%,这表明尽管存在显著的具身差异,技能迁移仍取得了成功。我们采用的隐式跨具身策略方法为统一不同具身间的动作空间提供了一种新途径,从而实现了高效的多机器人控制和跨机器人设置的数据共享。这种统一表示显著降低了对每种新机器人形态进行大量数据收集的需求,加速了跨具身的泛化,并最终促进了更可扩展且高效的机器人学习。