Policy gradient methods have enabled deep reinforcement learning (RL) to approach challenging continuous control problems, even when the underlying systems involve highly nonlinear dynamics that generate complex non-smooth optimization landscapes. We develop a rigorous framework for understanding how policy gradient methods mollify non-smooth optimization landscapes to enable effective policy search, as well as the downside of it: while making the objective function smoother and easier to optimize, the stochastic objective deviates further from the original problem. We demonstrate the equivalence between policy gradient methods and solving backward heat equations. Following the ill-posedness of backward heat equations from PDE theory, we present a fundamental challenge to the use of policy gradient under stochasticity. Moreover, we make the connection between this limitation and the uncertainty principle in harmonic analysis to understand the effects of exploration with stochastic policies in RL. We also provide experimental results to illustrate both the positive and negative aspects of mollification effects in practice.
翻译:策略梯度方法使得深度强化学习能够应对具有挑战性的连续控制问题,即使底层系统涉及高度非线性动力学并产生复杂的非光滑优化曲面。我们建立了一个严谨的框架,用以理解策略梯度方法如何通过平滑化非光滑优化曲面来实现有效的策略搜索,以及该方法存在的局限性:在使目标函数更平滑、更易于优化的同时,随机化目标函数会进一步偏离原始问题。我们证明了策略梯度方法与求解反向热方程之间的等价性。根据偏微分方程理论中反向热方程的不适定性,我们揭示了在随机性条件下使用策略梯度方法所面临的根本性挑战。此外,我们通过调和分析中的不确定性原理来理解强化学习中随机策略探索效应的局限性。我们还提供了实验结果,以说明平滑化效应在实际应用中的积极与消极影响。