Evolution Strategy (ES) is a powerful black-box optimization technique based on the idea of natural evolution. In each of its iterations, a key step entails ranking candidate solutions based on some fitness score. For an ES method in Reinforcement Learning (RL), this ranking step requires evaluating multiple policies. This is presently done via on-policy approaches: each policy's score is estimated by interacting several times with the environment using that policy. This leads to a lot of wasteful interactions since, once the ranking is done, only the data associated with the top-ranked policies is used for subsequent learning. To improve sample efficiency, we propose a novel off-policy alternative for ranking, based on a local approximation for the fitness function. We demonstrate our idea in the context of a state-of-the-art ES method called the Augmented Random Search (ARS). Simulations in MuJoCo tasks show that, compared to the original ARS, our off-policy variant has similar running times for reaching reward thresholds but needs only around 70% as much data. It also outperforms the recent Trust Region ES. We believe our ideas should be extendable to other ES methods as well.
翻译:进化策略(ES)是一种基于自然进化思想的强大黑盒优化技术。在其每次迭代中,关键步骤涉及根据某种适应度分数对候选解进行排序。对于强化学习(RL)中的ES方法而言,这一排序步骤需要评估多个策略。当前主要通过在线策略方法实现:每个策略的得分通过使用该策略与环境多次交互来估计。这导致大量浪费性交互,因为排序完成后,仅与排名靠前的策略相关的数据被用于后续学习。为提高样本效率,我们提出一种新颖的基于局部适应度函数近似的离线替代排序方法。我们在名为增强随机搜索(ARS)的最新ES方法中验证了这一思路。在MuJoCo任务上的仿真表明,与原始ARS相比,我们的离线变体在达到奖励阈值所需运行时间相近的情况下,仅需约70%的数据量。同时,该方法的性能也优于最近提出的信任域ES方法。我们相信该思路可推广至其他ES方法。