In recent years, by leveraging more data, computation, and diverse tasks, learned optimizers have achieved remarkable success in supervised learning, outperforming classical hand-designed optimizers. Reinforcement learning (RL) is essentially different from supervised learning and in practice these learned optimizers do not work well even in simple RL tasks. We investigate this phenomenon and identity three issues. First, the gradients of an RL agent vary across a wide range in logarithms while their absolute values are in a small range, making neural networks hard to obtain accurate parameter updates. Second, the agent-gradient distribution is non-independent and identically distributed, leading to inefficient meta-training. Finally, due to highly stochastic agent-environment interactions, the agent-gradients have high bias and variance, which increase the difficulty of learning an optimizer for RL. We propose gradient processing, pipeline training, and a novel optimizer structure with good inductive bias to address these issues. By applying these techniques, for the first time, we show that learning an optimizer for RL from scratch is possible. Although only trained in toy tasks, our learned optimizer can generalize to unseen complex tasks in Brax.
翻译:近年来,通过利用更多数据、计算资源和多样化任务,学习型优化器在监督学习中取得了显著成功,超越了经典的手工设计优化器。强化学习在本质上与监督学习不同,实践中这些学习型优化器即便在简单的强化学习任务中也表现不佳。我们研究了这一现象,并识别出三个问题。首先,强化学习智能体的梯度在对数尺度上变化范围较大,而其绝对值却处于较小范围,这使得神经网络难以获得精确的参数更新。其次,智能体梯度的分布非独立同分布,导致元训练效率低下。最后,由于智能体与环境的交互具有高度随机性,智能体梯度具有高偏差和高方差,这增加了为强化学习学习优化器的难度。我们提出梯度处理、流水线训练以及一种具有良好归纳偏置的新型优化器结构来解决这些问题。通过应用这些技术,我们首次证明从零开始为强化学习学习优化器是可行的。尽管仅在玩具任务上训练,我们学习到的优化器能够泛化到Brax中未见过的复杂任务。