When Learning Rates Go Wrong: Early Structural Signals in PPO Actor-Critic

Deep Reinforcement Learning systems are highly sensitive to the learning rate (LR), and selecting stable and performant training runs often requires extensive hyperparameter search. In Proximal Policy Optimization (PPO) actor--critic methods, small LR values lead to slow convergence, whereas large LR values may induce instability or collapse. We analyse this phenomenon from the behavior of the hidden neurons in the network using the Overfitting-Underfitting Indicator (OUI), a metric that quantifies the balance of binary activation patterns over a fixed probe batch. We introduce an efficient batch-based formulation of OUI and derive a theoretical connection between LR and activation sign changes, clarifying how a correct evolution of the neuron's inner structure depends on the step size. Empirically, across three discrete-control environments and multiple seeds, we show that OUI measured at only 10\% of training already discriminates between LR regimes. We observe a consistent asymmetry: critic networks achieving highest return operate in an intermediate OUI band (avoiding saturation), whereas actor networks achieving highest return exhibit comparatively high OUI values. We then compare OUI-based screening rules against early return, clip-based, divergence-based, and flip-based criteria under matched recall over successful runs. In this setting, OUI provides the strongest early screening signal: OUI alone achieves the best precision at broader recall, while combining early return with OUI yields the highest precision in best-performing screening regimes, enabling aggressive pruning of unpromising runs without requiring full training.

翻译：深度强化学习系统对学习率高度敏感，选择稳定且性能良好的训练过程通常需要进行大量的超参数搜索。在近端策略优化（PPO）的 Actor-Critic 方法中，过小的学习率会导致收敛缓慢，而过大的学习率则可能引发不稳定或训练崩溃。我们通过分析网络中隐藏神经元的行为来研究这一现象，所使用的度量是过拟合-欠拟合指示器（OUI），该指标量化了在固定探测批次上二元激活模式的平衡性。我们引入了一种高效的基于批次的 OUI 计算公式，并从理论上推导了学习率与激活符号变化之间的联系，阐明了神经元内部结构的正确演化如何依赖于步长大小。通过实验，在三个离散控制环境和多个随机种子下，我们证明仅需训练过程的 10% 所测得的 OUI 即可区分不同的学习率区间。我们观察到一个一致的不对称现象：获得最高回报的 Critic 网络运行在一个中间的 OUI 区间（避免饱和），而获得最高回报的 Actor 网络则表现出相对较高的 OUI 值。随后，我们在匹配成功训练过程召回率的前提下，将基于 OUI 的筛选规则与早期回报、基于裁剪、基于发散以及基于翻转的筛选标准进行了比较。在此设定下，OUI 提供了最强的早期筛选信号：仅使用 OUI 即可在更宽的召回率范围内获得最佳精确度，而将早期回报与 OUI 结合则能在最佳性能筛选区间内实现最高的精确度，从而能够在不进行完整训练的情况下，对无前景的训练过程进行积极的剪枝。