A major challenge of reinforcement learning (RL) in real-world applications is the variation between environments, tasks or clients. Meta-RL (MRL) addresses this issue by learning a meta-policy that adapts to new tasks. Standard MRL methods optimize the average return over tasks, but often suffer from poor results in tasks of high risk or difficulty. This limits system reliability since test tasks are not known in advance. In this work, we define a robust MRL objective with a controlled robustness level. Optimization of analogous robust objectives in RL is known to lead to both *biased gradients* and *data inefficiency*. We prove that the gradient bias disappears in our proposed MRL framework. The data inefficiency is addressed via the novel Robust Meta RL algorithm (RoML). RoML is a meta-algorithm that generates a robust version of any given MRL algorithm, by identifying and over-sampling harder tasks throughout training. We demonstrate that RoML achieves robust returns on multiple navigation and continuous control benchmarks.
翻译:强化学习(RL)在实际应用中的一大挑战是环境、任务或客户之间的差异。元强化学习(MRL)通过学习适应新任务的元策略来解决这一问题。标准MRL方法优化任务上的平均回报,但在高风险或困难任务上效果往往较差,这限制了系统可靠性,因为测试任务无法预先知晓。本文定义了具有可控鲁棒水平的鲁棒MRL目标。已知在RL中对类似鲁棒目标进行优化会导致*梯度偏差*和*数据低效*。我们证明,在我们提出的MRL框架中,梯度偏差消失了。数据低效问题通过新型鲁棒元强化学习算法(RoML)得到解决。RoML是一种元算法,通过在整个训练过程中识别并过度采样困难任务,为任意给定的MRL算法生成鲁棒版本。我们证明,RoML在多个导航和连续控制基准测试中实现了鲁棒回报。