Reinforcement learning (RL) algorithms have proven transformative in a range of domains. To tackle real-world domains, these systems often use neural networks to learn policies directly from pixels or other high-dimensional sensory input. By contrast, much theory of RL has focused on discrete state spaces or worst-case analysis, and fundamental questions remain about the dynamics of policy learning in high-dimensional settings. Here, we propose a solvable high-dimensional model of RL that can capture a variety of learning protocols, and derive its typical dynamics as a set of closed-form ordinary differential equations (ODEs). We derive optimal schedules for the learning rates and task difficulty - analogous to annealing schemes and curricula during training in RL - and show that the model exhibits rich behaviour, including delayed learning under sparse rewards; a variety of learning regimes depending on reward baselines; and a speed-accuracy trade-off driven by reward stringency. Experiments on variants of the Procgen game "Bossfight" and Arcade Learning Environment game "Pong" also show such a speed-accuracy trade-off in practice. Together, these results take a step towards closing the gap between theory and practice in high-dimensional RL.
翻译:强化学习算法在多个领域已被证明具有变革性作用。为应对现实世界场景,这些系统常利用神经网络直接从像素或其他高维感官输入中学习策略。相比之下,大量强化学习理论聚焦于离散状态空间或最坏情况分析,而关于高维场景中策略学习动力学的基本问题仍悬而未决。本文提出一种可解的高维强化学习模型,该模型能捕捉多种学习协议,并推导出其典型动力学特性——以一组封闭形式的常微分方程表示。我们针对学习率与任务难度推导出最优调度方案(类似于强化学习训练中的退火方案与课程安排),并表明该模型展现出丰富行为,包括稀疏奖励下的延迟学习、基于奖励基线的多种学习状态,以及由奖励严格性驱动的速度-准确率权衡。在Procgen游戏"Bossfight"与Arcade学习环境游戏"Pong"的变体实验中,此类速度-准确率权衡在实际中亦得到验证。这些成果共同迈出了缩小高维强化学习中理论与实践差距的一步。