Deep Reinforcement Learning (RL) is well known for being highly sensitive to hyperparameters, requiring practitioners substantial efforts to optimize them for the problem at hand. This also limits the applicability of RL in real-world scenarios. In recent years, the field of automated Reinforcement Learning (AutoRL) has grown in popularity by trying to address this issue. However, these approaches typically hinge on additional samples to select well-performing hyperparameters, hindering sample-efficiency and practicality. Furthermore, most AutoRL methods are heavily based on already existing AutoML methods, which were originally developed neglecting the additional challenges inherent to RL due to its non-stationarities. In this work, we propose a new approach for AutoRL, called Adaptive $Q$-Network (AdaQN), that is tailored to RL to take into account the non-stationarity of the optimization procedure without requiring additional samples. AdaQN learns several $Q$-functions, each one trained with different hyperparameters, which are updated online using the $Q$-function with the smallest approximation error as a shared target. Our selection scheme simultaneously handles different hyperparameters while coping with the non-stationarity induced by the RL optimization procedure and being orthogonal to any critic-based RL algorithm. We demonstrate that AdaQN is theoretically sound and empirically validate it in MuJoCo control problems and Atari $2600$ games, showing benefits in sample-efficiency, overall performance, robustness to stochasticity and training stability.
翻译:深度强化学习(RL)因其对超参数的高度敏感性而广为人知,需要从业者投入大量精力针对具体问题进行优化。这也限制了强化学习在现实场景中的适用性。近年来,自动化强化学习(AutoRL)领域通过尝试解决这一问题而日益受到关注。然而,这些方法通常依赖额外样本来选择性能良好的超参数,从而影响了样本效率和实用性。此外,大多数AutoRL方法严重依赖现有的自动化机器学习(AutoML)方法,而这些方法最初开发时忽略了强化学习因其非平稳性所固有的额外挑战。本文提出了一种名为自适应Q网络(AdaQN)的新型AutoRL方法,该方法专为强化学习设计,能够在不依赖额外样本的情况下考虑优化过程的非平稳性。AdaQN学习多个Q函数,每个函数使用不同的超参数进行训练,并通过将近似误差最小的Q函数作为共享目标在线更新这些函数。我们的选择方案能够同时处理不同的超参数,应对强化学习优化过程引入的非平稳性,并且与任何基于价值函数的强化学习算法正交。我们证明了AdaQN在理论上的合理性,并在MuJoCo控制问题和Atari 2600游戏中进行了实证验证,结果表明其在样本效率、整体性能、随机性鲁棒性和训练稳定性方面均具有优势。