Training a robotic policy from scratch using deep reinforcement learning methods can be prohibitively expensive due to sample inefficiency. To address this challenge, transferring policies trained in the source domain to the target domain becomes an attractive paradigm. Previous research has typically focused on domains with similar state and action spaces but differing in other aspects. In this paper, our primary focus lies in domains with different state and action spaces, which has broader practical implications, i.e. transfer the policy from robot A to robot B. Unlike prior methods that rely on paired data, we propose a novel approach for learning the mapping functions between state and action spaces across domains using unpaired data. We propose effect cycle consistency, which aligns the effects of transitions across two domains through a symmetrical optimization structure for learning these mapping functions. Once the mapping functions are learned, we can seamlessly transfer the policy from the source domain to the target domain. Our approach has been tested on three locomotion tasks and two robotic manipulation tasks. The empirical results demonstrate that our method can reduce alignment errors significantly and achieve better performance compared to the state-of-the-art method.
翻译:利用深度强化学习方法从零开始训练机器人策略可能因样本效率低下而代价高昂。为解决这一挑战,将源域中训练好的策略迁移至目标域成为一种有吸引力的范式。以往研究通常聚焦于状态空间和动作空间相似但其他方面存在差异的领域。本文主要关注状态空间和动作空间不同的领域(例如将策略从机器人A迁移至机器人B),这具有更广泛的实践意义。与依赖配对数据的先前方法不同,我们提出了一种利用非配对数据学习跨域状态空间与动作空间映射函数的新型方法。我们提出效果循环一致性(effect cycle consistency),通过对称优化结构对齐两域间的转移效果,从而学习这些映射函数。一旦映射函数被习得,即可将源域策略无缝迁移至目标域。该方法已在三个运动任务和两个机器人操作任务上进行了测试。实验结果表明,与当前最先进方法相比,我们的方法能显著降低对齐误差并取得更优性能。