Policy-based algorithms are among the most widely adopted techniques in model-free RL, thanks to their strong theoretical groundings and good properties in continuous action spaces. Unfortunately, these methods require precise and problem-specific hyperparameter tuning to achieve good performance, and tend to struggle when asked to accomplish a series of heterogeneous tasks. In particular, the selection of the step size has a crucial impact on their ability to learn a highly performing policy, affecting the speed and the stability of the training process, and often being the main culprit for poor results. In this paper, we tackle these issues with a Meta Reinforcement Learning approach, by introducing a new formulation, known as meta-MDP, that can be used to solve any hyperparameter selection problem in RL with contextual processes. After providing a theoretical Lipschitz bound to the difference of performance in different tasks, we adopt the proposed framework to train a batch RL algorithm to dynamically recommend the most adequate step size for different policies and tasks. In conclusion, we present an experimental campaign to show the advantages of selecting an adaptive learning rate in heterogeneous environments.
翻译:基于策略的算法因其坚实的理论基础和在连续动作空间中的良好特性,成为无模型强化学习中最广泛采用的技术之一。然而,这些方法需要精确且针对特定问题的超参数调优才能获得良好性能,并且在完成一系列异质任务时往往面临困难。特别是,步长的选择对其学习高性能策略的能力具有关键影响,影响着训练过程的速度和稳定性,并且通常是导致结果不佳的主要原因。本文通过引入一种新的公式——称为元马尔可夫决策过程(meta-MDP)——以元强化学习方法解决这些问题,该公式可用于解决RL中具有上下文过程的任何超参数选择问题。在提供不同任务性能差异的理论Lipschitz界后,我们采用所提出的框架训练一个批处理RL算法,以动态推荐适用于不同策略和任务的最合适步长。最后,我们通过实验展示了在异质环境中选择自适应学习率的优势。