Finding different solutions to the same problem is a key aspect of intelligence associated with creativity and adaptation to novel situations. In reinforcement learning, a set of diverse policies can be useful for exploration, transfer, hierarchy, and robustness. We propose DOMiNO, a method for Diversity Optimization Maintaining Near Optimality. We formalize the problem as a Constrained Markov Decision Process where the objective is to find diverse policies, measured by the distance between the state occupancies of the policies in the set, while remaining near-optimal with respect to the extrinsic reward. We demonstrate that the method can discover diverse and meaningful behaviors in various domains, such as different locomotion patterns in the DeepMind Control Suite. We perform extensive analysis of our approach, compare it with other multi-objective baselines, demonstrate that we can control both the quality and the diversity of the set via interpretable hyperparameters, and show that the discovered set is robust to perturbations.
翻译:寻找同一问题的不同解决方案是智能的一个关键方面,与创造力和适应新情境的能力相关。在强化学习中,一组多样化的策略可用于探索、迁移、层级化和鲁棒性。我们提出DOMiNO,一种保持接近最优的多样性优化方法。我们将该问题形式化为一个约束马尔可夫决策过程,其目标是找到多样化的策略——通过集合中策略的状态占用之间的距离来衡量——同时相对于外部奖励保持接近最优。我们证明该方法能够在各种领域中(如DeepMind控制套件中的不同运动模式)发现多样化且有意义的行为。我们对我们的方法进行了广泛分析,与其他多目标基线进行了比较,展示了我们可以通过可解释的超参数控制集合的质量和多样性,并表明所发现的集合对扰动具有鲁棒性。