We consider the problem of policy transfer between two Markov Decision Processes (MDPs). We introduce a lemma based on existing theoretical results in reinforcement learning to measure the relativity gap between two arbitrary MDPs, that is the difference between any two cumulative expected returns defined on different policies and environment dynamics. Based on this lemma, we propose two new algorithms referred to as Relative Policy Optimization (RPO) and Relative Transition Optimization (RTO), which offer fast policy transfer and dynamics modelling, respectively. RPO transfers the policy evaluated in one environment to maximize the return in another, while RTO updates the parameterized dynamics model to reduce the gap between the dynamics of the two environments. Integrating the two algorithms results in the complete Relative Policy-Transition Optimization (RPTO) algorithm, in which the policy interacts with the two environments simultaneously, such that data collections from two environments, policy and transition updates are completed in one closed loop to form a principled learning framework for policy transfer. We demonstrate the effectiveness of RPTO on a set of MuJoCo continuous control tasks by creating policy transfer problems via variant dynamics.
翻译:我们考虑了在两个马尔可夫决策过程(MDP)之间进行策略迁移的问题。基于强化学习中的现有理论结果,我们引入了一个引理来衡量任意两个MDP之间的相对差距,即定义在不同策略和环境动态上的任意两个累积期望回报之间的差异。基于该引理,我们提出了两种新算法,分别称为相对策略优化(RPO)和相对转移优化(RTO),它们分别实现了快速策略迁移和动力学建模。RPO将在一种环境中评估的策略迁移到另一种环境中以最大化回报,而RTO则更新参数化的动力学模型以减少两种环境动态之间的差距。将这两种算法整合得到完整的相对策略-转移优化(RPTO)算法,在该算法中,策略同时与两种环境交互,使得来自两种环境的数据收集、策略更新和转移更新在一个闭环中完成,从而形成用于策略迁移的原则性学习框架。我们通过在MuJoCo连续控制任务上通过变体动态创建策略迁移问题,验证了RPTO的有效性。