We present a simple, sample-efficient algorithm for introducing large but directed learning steps in reinforcement learning (RL), through the use of evolutionary operators. The methodology uses a population of RL agents training with a common experience buffer, with occasional crossovers and mutations of the agents in order to search efficiently through the policy space. Unlike prior literature on combining evolutionary search (ES) with RL, this work does not generate a distribution of agents from a common mean and covariance matrix. Neither does it require the evaluation of the entire population of policies at every time step. Instead, we focus on gradient-based training throughout the life of every policy (individual), with a sparse amount of evolutionary exploration. The resulting algorithm is shown to be robust to hyperparameter variations. As a surprising corollary, we show that simply initialising and training multiple RL agents with a common memory (with no further evolutionary updates) outperforms several standard RL baselines.
翻译:我们提出了一种简单且样本高效的算法,通过引入进化操作器在强化学习(RL)中实现大而定向的学习步骤。该方法使用一个共享经验缓冲区的RL智能体种群进行训练,并偶尔对智能体进行交叉与变异,以在策略空间中高效搜索。与先前将进化搜索(ES)与RL结合的文献不同,本研究既不需要从共同均值与协方差矩阵生成智能体分布,也无需在每个时间步评估整个策略种群。相反,我们聚焦于每个策略(个体)整个生命周期中的基于梯度的训练,同时辅以稀疏的进化探索。实验表明,所得算法对超参数变化具有鲁棒性。作为一个令人惊讶的推论,我们证明:仅通过共享记忆初始化并训练多个RL智能体(无需进一步进化更新)即可超越多个标准RL基线方法。