Recent advancements in meta-learning have enabled the automatic discovery of novel reinforcement learning algorithms parameterized by surrogate objective functions. To improve upon manually designed algorithms, the parameterization of this learned objective function must be expressive enough to represent novel principles of learning (instead of merely recovering already established ones) while still generalizing to a wide range of settings outside of its meta-training distribution. However, existing methods focus on discovering objective functions that, like many widely used objective functions in reinforcement learning, do not take into account the total number of steps allowed for training, or "training horizon". In contrast, humans use a plethora of different learning objectives across the course of acquiring a new ability. For instance, students may alter their studying techniques based on the proximity to exam deadlines and their self-assessed capabilities. This paper contends that ignoring the optimization time horizon significantly restricts the expressive potential of discovered learning algorithms. We propose a simple augmentation to two existing objective discovery approaches that allows the discovered algorithm to dynamically update its objective function throughout the agent's training procedure, resulting in expressive schedules and increased generalization across different training horizons. In the process, we find that commonly used meta-gradient approaches fail to discover such adaptive objective functions while evolution strategies discover highly dynamic learning rules. We demonstrate the effectiveness of our approach on a wide range of tasks and analyze the resulting learned algorithms, which we find effectively balance exploration and exploitation by modifying the structure of their learning rules throughout the agent's lifetime.
翻译:元学习的最新进展使得能够自动发现由代理目标函数参数化的新型强化学习算法。为了改进人工设计的算法,这一学习目标函数的参数化必须足够表达性强,能够代表新颖的学习原则(而非仅仅恢复已有的原则),同时还要能够泛化到元训练分布之外的广泛场景中。然而,现有方法侧重于发现那些与强化学习中广泛使用的目标函数类似、未考虑训练允许的总步数(即“训练时长”)的目标函数。相比之下,人类在掌握新能力的过程中会使用多种不同的学习目标。例如,学生会根据考试截止日期的接近程度以及自我评估的能力来调整学习技巧。本文提出,忽略优化时间跨度会显著限制所发现学习算法的表达潜力。我们提出了一种对现有两种目标函数发现方法的简单扩展,使得所发现的算法能够在智能体的训练过程中动态更新其目标函数,从而产生具有表达性的调度策略,并增强在不同训练时长下的泛化能力。在此过程中,我们发现常用的元梯度方法无法发现此类自适应目标函数,而进化策略则能发现高度动态的学习规则。我们在广泛的任务上展示了该方法的有效性,并分析了最终学习到的算法,发现这些算法通过在整个智能体生命周期内修改学习规则的结构,有效平衡了探索与利用。