Singular learning theory characterizes Bayesian learning as an evolving tradeoff between accuracy and complexity, with transitions between qualitatively different solutions as sample size increases. We extend this theory to deep reinforcement learning, proving that the concentration of the generalized posterior over policies is governed by the local learning coefficient (LLC), an invariant of the geometry of the regret function. This theory predicts that Bayesian phase transitions in reinforcement learning should proceed from simple policies with high regret to complex policies with low regret. We verify this prediction empirically in a gridworld environment exhibiting stagewise policy development: phase transitions over SGD training manifest as "opposing staircases" where regret decreases sharply while the LLC increases. Notably, the LLC detects phase transitions even when estimated on a subset of states where the policies appear identical in terms of regret, suggesting it captures changes in the underlying algorithm rather than just performance.
翻译:奇异学习理论将贝叶斯学习描述为准确性与复杂性之间动态权衡的过程,并揭示了随着样本量增加,学习过程会在不同质的解之间发生转变。我们将该理论拓展至深度强化学习领域,证明策略广义后验分布的集中性由局部学习系数(LLC)所主导——该系数是遗憾函数几何结构的不变量。该理论预测,强化学习中的贝叶斯相变应遵循从高遗憾的简单策略向低遗憾的复杂策略演进的规律。我们在呈现阶段性策略发展的网格世界环境中通过实验验证了这一预测:随机梯度下降训练过程中的相变表现为“对向阶梯”模式,即遗憾值急剧下降的同时局部学习系数同步上升。值得注意的是,即使在策略遗憾表现完全相同的状态子集上,局部学习系数仍能有效检测相变现象,这表明其捕捉的是底层算法的本质变化,而不仅仅是性能指标的波动。