We consider the problem of policy transfer between two Markov Decision Processes (MDPs). We introduce a lemma based on existing theoretical results in reinforcement learning to measure the relativity gap between two arbitrary MDPs, that is the difference between any two cumulative expected returns defined on different policies and environment dynamics. Based on this lemma, we propose two new algorithms referred to as Relative Policy Optimization (RPO) and Relative Transition Optimization (RTO), which offer fast policy transfer and dynamics modelling, respectively. RPO transfers the policy evaluated in one environment to maximize the return in another, while RTO updates the parameterized dynamics model to reduce the gap between the dynamics of the two environments. Integrating the two algorithms results in the complete Relative Policy-Transition Optimization (RPTO) algorithm, in which the policy interacts with the two environments simultaneously, such that data collections from two environments, policy and transition updates are completed in one closed loop to form a principled learning framework for policy transfer. We demonstrate the effectiveness of RPTO on a set of MuJoCo continuous control tasks by creating policy transfer problems via variant dynamics.
翻译:我们研究了两个马尔可夫决策过程(MDP)之间的策略迁移问题。基于强化学习中已有的理论结果,我们引入了一个引理来度量任意两个MDP之间的相对差距,即定义在不同策略和环境动力学上的任意两个累积期望回报之差。基于该引理,我们提出了两种新算法,分别称为相对策略优化(RPO)和相对转移优化(RTO),分别实现快速策略迁移和动力学建模。RPO将一个环境中评估得到的策略迁移至另一个环境以最大化其回报,而RTO更新参数化的动力学模型以缩小两个环境动力学之间的差距。将两种算法整合得到完整的相对策略-转移优化(RPTO)算法,在该算法中,策略同时与两个环境交互,使得来自两个环境的数据收集、策略更新和转移更新在同一闭环中完成,从而形成一种原则性的策略迁移学习框架。我们通过在MuJoCo连续控制任务中构建具有变动力学特性的策略迁移问题,验证了RPTO的有效性。