A major challenge of reinforcement learning (RL) in real-world applications is the variation between environments, tasks or clients. Meta-RL (MRL) addresses this issue by learning a meta-policy that adapts to new tasks. Standard MRL methods optimize the average return over tasks, but often suffer from poor results in tasks of high risk or difficulty. This limits system reliability whenever test tasks are not known in advance. In this work, we propose a robust MRL objective with a controlled robustness level. Optimization of analogous robust objectives in RL often leads to both biased gradients and data inefficiency. We prove that the former disappears in MRL, and address the latter via the novel Robust Meta RL algorithm (RoML). RoML is a meta-algorithm that generates a robust version of any given MRL algorithm, by identifying and over-sampling harder tasks throughout training. We demonstrate that RoML learns substantially different meta-policies and achieves robust returns on several navigation and continuous control benchmarks.
翻译:现实应用中的强化学习面临环境、任务或客户之间变化的重大挑战。元强化学习通过学习适应新任务的元策略来解决这一问题。标准元强化学习方法优化任务的平均回报,但在高风险或高难度任务中往往表现不佳。当测试任务无法预先知晓时,这会限制系统可靠性。本文提出了一种具有可控稳健水平的稳健元强化学习目标。在强化学习中,对类似稳健目标的优化通常会导致梯度偏差和数据效率低下。我们证明了前者在元强化学习中消失,并通过新型稳健元强化学习算法解决了后者。RoML是一种元算法,可通过在训练过程中识别并过采样较难任务,为任何给定元强化学习算法生成稳健版本。我们证明,RoML能学到显著不同的元策略,并在多个导航和连续控制基准上实现稳健回报。