Singular learning theory characterizes Bayesian learning as an evolving tradeoff between accuracy and complexity, with transitions between qualitatively different solutions as sample size increases. We extend this theory to reinforcement learning, proving that the concentration of a generalized posterior over policies is governed by the local learning coefficient (LLC), an invariant of the geometry of the regret function. This theory predicts that deep reinforcement learning with SGD should proceed from simple policies with high regret to complex policies with low regret. We verify this prediction empirically in a gridworld environment exhibiting stagewise policy development: phase transitions over training manifest as "opposing staircases" where regret decreases sharply while the LLC increases.
翻译:奇异学习理论将贝叶斯学习描述为准确性与复杂性之间随样本量增加而演化的权衡过程,其间伴随着不同性质解之间的转变。我们将该理论拓展至强化学习领域,证明策略广义后验分布的集中性由局部学习系数(LLC)所主导——该系数是遗憾函数几何结构的不变量。该理论预测,基于随机梯度下降的深度强化学习应当经历从高遗憾的简单策略向低遗憾的复杂策略演进的过程。我们在网格世界环境中通过实证验证了这一预测,该环境展现出阶段性策略发展特征:训练过程中的相变表现为"反向阶梯"现象,即遗憾值急剧下降的同时局部学习系数同步上升。