Reinforcement Learning has drawn huge interest as a tool for solving optimal control problems. Solving a given problem (task or environment) involves converging towards an optimal policy. However, there might exist multiple optimal policies that can dramatically differ in their behaviour; for example, some may be faster than the others but at the expense of greater risk. We consider and study a distribution of optimal policies. We design a curiosity-augmented Metropolis algorithm (CAMEO), such that we can sample optimal policies, and such that these policies effectively adopt diverse behaviours, since this implies greater coverage of the different possible optimal policies. In experimental simulations we show that CAMEO indeed obtains policies that all solve classic control problems, and even in the challenging case of environments that provide sparse rewards. We further show that the different policies we sample present different risk profiles, corresponding to interesting practical applications in interpretability, and represents a first step towards learning the distribution of optimal policies itself.
翻译:强化学习作为解决最优控制问题的工具已引起广泛关注。解决特定问题(任务或环境)涉及收敛到最优策略。然而,可能存在多个行为迥异的最优策略;例如,有些策略可能更快,但以承担更大风险为代价。我们考虑并研究最优策略的分布。我们设计了一种好奇心增强的Metropolis算法(CAMEO),使得我们能够采样最优策略,并且这些策略能有效呈现多样化行为,因为这有助于更全面地覆盖不同的可能最优策略。在实验模拟中,我们展示CAMEO确实获得了能够解决经典控制问题的策略,甚至在提供稀疏奖励的具有挑战性的环境中也是如此。我们进一步证明,采样的不同策略展现出不同的风险特征,这对应了解释性方面的有趣实际应用,并代表了向学习最优策略分布本身迈出的第一步。