Knowing the learning dynamics of policy is significant to unveiling the mysteries of Reinforcement Learning (RL). It is especially crucial yet challenging to Deep RL, from which the remedies to notorious issues like sample inefficiency and learning instability could be obtained. In this paper, we study how the policy networks of typical DRL agents evolve during the learning process by empirically investigating several kinds of temporal change for each policy parameter. On typical MuJoCo and DeepMind Control Suite (DMC) benchmarks, we find common phenomena for TD3 and RAD agents: 1) the activity of policy network parameters is highly asymmetric and policy networks advance monotonically along very few major parameter directions; 2) severe detours occur in parameter update and harmonic-like changes are observed for all minor parameter directions. By performing a novel temporal SVD along policy learning path, the major and minor parameter directions are identified as the columns of right unitary matrix associated with dominant and insignificant singular values respectively. Driven by the discoveries above, we propose a simple and effective method, called Policy Path Trimming and Boosting (PPTB), as a general plug-in improvement to DRL algorithms. The key idea of PPTB is to periodically trim the policy learning path by canceling the policy updates in minor parameter directions, while boost the learning path by encouraging the advance in major directions. In experiments, we demonstrate the general and significant performance improvements brought by PPTB, when combined with TD3 and RAD in MuJoCo and DMC environments respectively.
翻译:理解策略的学习动态对于揭示强化学习(RL)的奥秘至关重要,尤其在深度强化学习(DRL)中,这为应对样本效率低下与学习不稳定等顽疾提供了可能的解决方案。本文通过实证研究每个策略参数在多种时间尺度上的变化,探讨典型DRL智能体的策略网络在学习过程中的演化规律。基于MuJoCo与DeepMind Control Suite(DMC)基准测试,我们发现TD3和RAD智能体存在两种共性现象:1)策略网络参数的活动具有高度非对称性,且网络沿极少数主要参数方向单调推进;2)参数更新中出现严重的迂回路径,所有次要参数方向均显现类谐波变化模式。通过对策略学习路径实施新型时间序列奇异值分解(SVD),我们将主要与次要参数方向分别定义为与主导奇异值和次要奇异值关联的右酉矩阵列向量。基于上述发现,本文提出一种名为策略路径修剪与增强(PPTB)的简单有效方法,可作为DRL算法的通用插件式改进。PPTB的核心思想是:通过取消次要参数方向上的策略更新来周期性修剪策略学习路径,同时通过促进主要方向的前进来增强学习路径。实验表明,当PPTB分别与TD3和RAD算法结合应用于MuJoCo与DMC环境时,能够带来普遍且显著的性能提升。