Meta-Reinforcement Learning (MRL) is a promising framework for training agents that can quickly adapt to new environments and tasks. In this work, we study the MRL problem under the policy gradient formulation, where we propose a novel algorithm that uses Moreau envelope surrogate regularizers to jointly learn a meta-policy that is adjustable to the environment of each individual task. Our algorithm, called Moreau Envelope Meta-Reinforcement Learning (MEMRL), learns a meta-policy that can adapt to a distribution of tasks by efficiently updating the policy parameters using a combination of gradient-based optimization and Moreau Envelope regularization. Moreau Envelopes provide a smooth approximation of the policy optimization problem, which enables us to apply standard optimization techniques and converge to an appropriate stationary point. We provide a detailed analysis of the MEMRL algorithm, where we show a sublinear convergence rate to a first-order stationary point for non-convex policy gradient optimization. We finally show the effectiveness of MEMRL on a multi-task 2D-navigation problem.
翻译:元强化学习(MRL)是一种有前景的训练框架,能使智能体快速适应新环境和任务。本文研究基于策略梯度公式的MRL问题,提出一种新型算法,利用Moreau包络替代正则化器联合学习可针对各任务环境调整的元策略。该算法称为Moreau包络元强化学习(MEMRL),通过结合梯度优化与Moreau包络正则化高效更新策略参数,学习能适应任务分布的元策略。Moreau包络提供了策略优化问题的平滑近似,使我们能够应用标准优化技术并收敛到适当的驻点。我们详细分析了MEMRL算法,证明其在非凸策略梯度优化中具有一阶驻点的次线性收敛率。最后,通过多任务二维导航问题验证了MEMRL的有效性。