We provide a new perspective to understand why reinforcement learning (RL) struggles with robustness and generalization. We show, by examples, that local optimal policies may contain unstable control for some dynamic parameters and overfitting to such instabilities can deteriorate robustness and generalization. Contraction analysis of neural control reveals that there exists boundaries between stable and unstable control with respect to the input gradients of control networks. Ignoring those stability boundaries, learning agents may label the actions that cause instabilities for some dynamic parameters as high value actions if those actions can improve the expected return. The small fraction of such instabilities may not cause attention in the empirical studies, a hidden risk for real-world applications. Those instabilities can manifest themselves via overfitting, leading to failures in robustness and generalization. We propose stability constraints and terminal constraints to solve this issue, demonstrated with a proximal policy optimization example.
翻译:我们提供一个新的视角来理解强化学习在鲁棒性与泛化性方面为何面临挑战。通过实例表明,局部最优策略可能包含针对某些动态参数的不稳定控制,而对此类不稳定性的过拟合会降低模型的鲁棒性与泛化能力。对神经控制的收缩分析揭示,控制网络输入梯度存在稳定控制与不稳定控制之间的边界。若忽略这些稳定性边界,当某些动作能提升期望回报时,学习智能体可能将导致特定动态参数不稳定的动作标记为高价值动作。这种不稳定性虽仅占极小比例,但在实证研究中不易引起关注,却为实际应用埋下隐患。这些不稳定性可通过过拟合显现,最终导致鲁棒性与泛化能力的失效。我们提出稳定性约束与终端约束来破解该问题,并通过近端策略优化案例加以验证。