Evolution Strategies (ES) are effective gradient-free optimization methods that can be competitive with gradient-based approaches for policy search. ES only rely on the total episodic scores of solutions in their population, from which they estimate fitness gradients for their update with no access to true gradient information. However this makes them sensitive to deceptive fitness landscapes, and they tend to only explore one way to solve a problem. Quality-Diversity methods such as MAP-Elites introduced additional information with behavior descriptors (BD) to return a population of diverse solutions, which helps exploration but leads to a large part of the evaluation budget not being focused on finding the best performing solution. Here we show that behavior information can also be leveraged to find the best policy by identifying promising search areas which can then be efficiently explored with ES. We introduce the framework of Quality with Just Enough Diversity (JEDi) which learns the relationship between behavior and fitness to focus evaluations on solutions that matter. When trying to reach higher fitness values, JEDi outperforms both QD and ES methods on hard exploration tasks like mazes and on complex control problems with large policies.
翻译:进化策略(ES)作为有效的无梯度优化方法,在策略搜索中可与基于梯度的方法相抗衡。ES仅依赖种群中解的总体回合得分,据此估计适应度梯度进行更新,而不接触真实梯度信息。然而这种特性使其对欺骗性适应度景观敏感,且通常仅探索单一问题求解路径。如MAP-Elites这类质量-多样性方法引入行为描述符(BD)的额外信息来生成多样化解种群,虽有助于探索,但导致大部分评估预算未能聚焦于寻找最优解。本文证明,行为信息同样可用于通过识别有前景的搜索区域来发现最优策略,这些区域随后可通过ES高效探索。我们提出"适度多样性的质量优化"(JEDi)框架,该框架学习行为与适应度之间的关联,将评估重点聚焦于关键解。在迷宫等困难探索任务及需大规模策略的复杂控制问题中,JEDi在追求更高适应度值时均显著优于QD和ES方法。