Reinforcement learning (RL) algorithms have proven transformative in a range of domains. To tackle real-world domains, these systems often use neural networks to learn policies directly from pixels or other high-dimensional sensory input. By contrast, much theory of RL has focused on discrete state spaces or worst-case analysis, and fundamental questions remain about the dynamics of policy learning in high-dimensional settings. Here, we propose a solvable high-dimensional model of RL that can capture a variety of learning protocols, and derive its typical dynamics as a set of closed-form ordinary differential equations (ODEs). We derive optimal schedules for the learning rates and task difficulty - analogous to annealing schemes and curricula during training in RL - and show that the model exhibits rich behaviour, including delayed learning under sparse rewards; a variety of learning regimes depending on reward baselines; and a speed-accuracy trade-off driven by reward stringency. Experiments on variants of the Procgen game "Bossfight" and Arcade Learning Environment game "Pong" also show such a speed-accuracy trade-off in practice. Together, these results take a step towards closing the gap between theory and practice in high-dimensional RL.
翻译:强化学习(RL)算法已在多个领域展现出变革性作用。为应对现实世界场景,这些系统常使用神经网络直接从像素或其他高维感官输入中学习策略。然而,现有RL理论大多聚焦于离散状态空间或最坏情况分析,对于高维环境下策略学习的动力学仍存在根本性问题。本文提出一种可求解的高维RL模型,该模型能够涵盖多种学习协议,并通过一组封闭形式的常微分方程(ODEs)推导其典型动力学。我们推导了学习率与任务难度的最优调度方案(类似于RL训练中的退火策略与课程学习),并表明该模型展现出包括稀疏奖励下的延迟学习、基于奖励基线的多种学习机制,以及由奖励严苛性驱动的速度—准确性权衡在内的丰富行为。在Procgen游戏"Bossfight"变体及Arcade学习环境游戏"Pong"上的实验也验证了实际中存在的这种速度—准确性权衡。这些成果共同为弥合高维RL理论与实践之间的差距迈出了重要一步。