Reinforcement learning (RL) algorithms have proven transformative in a range of domains. To tackle real-world domains, these systems often use neural networks to learn policies directly from pixels or other high-dimensional sensory input. By contrast, much theory of RL has focused on discrete state spaces or worst-case analysis, and fundamental questions remain about the dynamics of policy learning in high-dimensional settings. Here, we propose a solvable high-dimensional model of RL that can capture a variety of learning protocols, and derive its typical dynamics as a set of closed-form ordinary differential equations (ODEs). We derive optimal schedules for the learning rates and task difficulty - analogous to annealing schemes and curricula during training in RL - and show that the model exhibits rich behaviour, including delayed learning under sparse rewards; a variety of learning regimes depending on reward baselines; and a speed-accuracy trade-off driven by reward stringency. Experiments on variants of the Procgen game "Bossfight" and Arcade Learning Environment game "Pong" also show such a speed-accuracy trade-off in practice. Together, these results take a step towards closing the gap between theory and practice in high-dimensional RL.
翻译:强化学习算法在众多领域已展现出变革性作用。为应对现实世界场景,这些系统通常使用神经网络直接从像素或其他高维感官输入中学习策略。与此形成对比的是,强化学习的大部分理论聚焦于离散状态空间或最坏情况分析,在高维环境中策略学习的动力学仍存在根本性问题。本文提出一种可解的高维强化学习模型,该模型能够捕获多种学习协议,并推导出其典型动力学为一组封闭形式的常微分方程。我们推导了学习率和任务难度的最优调度方案——类似于强化学习训练中的退火策略与课程安排——并证明该模型展现出丰富的动力学行为:包括稀疏奖励下的延迟学习、依赖于奖励基线的多种学习模式,以及由奖励严格性驱动的速度-精度权衡。在Procgen游戏"Bossfight"和Arcade Learning Environment游戏"Pong"的变体实验中,此类速度-精度权衡在实际场景中亦得到验证。这些结果共同推进了高维强化学习理论与实践之间鸿沟的弥合。