Deep Reinforcement Learning (Deep RL) and Evolutionary Algorithms (EA) are two major paradigms of policy optimization with distinct learning principles, i.e., gradient-based v.s. gradient-free. An appealing research direction is integrating Deep RL and EA to devise new methods by fusing their complementary advantages. However, existing works on combining Deep RL and EA have two common drawbacks: 1) the RL agent and EA agents learn their policies individually, neglecting efficient sharing of useful common knowledge; 2) parameter-level policy optimization guarantees no semantic level of behavior evolution for the EA side. In this paper, we propose Evolutionary Reinforcement Learning with Two-scale State Representation and Policy Representation (ERL-Re$^2$), a novel solution to the aforementioned two drawbacks. The key idea of ERL-Re$^2$ is two-scale representation: all EA and RL policies share the same nonlinear state representation while maintaining individual} linear policy representations. The state representation conveys expressive common features of the environment learned by all the agents collectively; the linear policy representation provides a favorable space for efficient policy optimization, where novel behavior-level crossover and mutation operations can be performed. Moreover, the linear policy representation allows convenient generalization of policy fitness with the help of the Policy-extended Value Function Approximator (PeVFA), further improving the sample efficiency of fitness estimation. The experiments on a range of continuous control tasks show that ERL-Re$^2$ consistently outperforms advanced baselines and achieves the State Of The Art (SOTA). Our code is available on https://github.com/yeshenpy/ERL-Re2.
翻译:深度强化学习(Deep RL)与进化算法(EA)是策略优化的两大重要范式,分别遵循基于梯度与免梯度的不同学习原理。融合二者互补优势以设计新方法是一个极具吸引力的研究方向。然而,现有结合Deep RL与EA的工作存在两个普遍缺陷:1)RL智能体与EA智能体独立学习各自策略,未能有效共享有价值的通用知识;2)EA侧基于参数级别的策略优化无法保证行为进化的语义层面。本文提出双尺度状态表示与策略表示的进化强化学习(ERL-Re$^2$),针对上述两个缺陷提出创新解决方案。ERL-Re$^2$的核心思想在于双尺度表示:所有EA和RL策略共享相同的非线性状态表示,同时保持各自的线性策略表示。状态表示由所有智能体共同学习到的环境通用特征构成;线性策略表示为高效策略优化提供了有利空间,可在此空间中执行新型行为级交叉与变异操作。此外,线性策略表示借助策略扩展价值函数逼近器(PeVFA)实现策略适应度的便捷泛化,进一步提升适应度估计的样本效率。在多个连续控制任务上的实验表明,ERL-Re$^2$始终优于先进基线方法,达到当前最优性能(SOTA)。我们的代码已开源:https://github.com/yeshenpy/ERL-Re2。