While reinforcement learning (RL) holds great potential for decision making in the real world, it suffers from a number of unique difficulties which often need specific consideration. In particular: it is highly non-stationary; suffers from high degrees of plasticity loss; and requires exploration to prevent premature convergence to local optima and maximize return. In this paper, we consider whether learned optimization can help overcome these problems. Our method, Learned Optimization for Plasticity, Exploration and Non-stationarity (OPEN), meta-learns an update rule whose input features and output structure are informed by previously proposed solutions to these difficulties. We show that our parameterization is flexible enough to enable meta-learning in diverse learning contexts, including the ability to use stochasticity for exploration. Our experiments demonstrate that when meta-trained on single and small sets of environments, OPEN outperforms or equals traditionally used optimizers. Furthermore, OPEN shows strong generalization across a distribution of environments and a range of agent architectures.
翻译:尽管强化学习(RL)在现实世界决策中具有巨大潜力,但它面临一系列独特的困难,往往需要专门应对。具体而言:强化学习具有高度非平稳性;易受可塑性严重丧失的影响;且需要探索以避免过早收敛于局部最优并最大化回报。本文探讨学习优化是否有助于克服这些问题。我们提出的方法——面向可塑性、探索与非平稳性的学习优化(OPEN)——通过元学习得到一个更新规则,其输入特征与输出结构均参考了先前针对这些困难提出的解决方案。我们证明,该参数化方法具有足够的灵活性,可在多样化学习场景中实现元学习,包括利用随机性进行探索的能力。实验表明,在单个及小型环境集合上进行元训练时,OPEN 的性能优于或等同于传统优化器。此外,OPEN 在环境分布和智能体架构范围内均展现出强大的泛化能力。