Deep Reinforcement Learning (RL) agents often learn policies that achieve the same episodic return yet behave very differently, due to a combination of environmental (random transitions, initial conditions, reward noise) and algorithmic (minibatch selection, exploration noise) factors. In continuous control tasks, even small parameter shifts can produce unstable gaits, complicating both algorithm comparison and real-world transfer. Previous work has shown that such instability arises when policy updates traverse noisy neighborhoods and that the spread of post-update return distribution $R(θ)$, obtained by repeatedly sampling minibatches, updating $θ$, and measuring final returns, is a useful indicator of this noise. Although explicitly constraining the policy to maintain a narrow $R(θ)$ can improve stability, directly estimating $R(θ)$ is computationally expensive in high-dimensional settings. We propose an alternative that takes advantage of environmental stochasticity to mitigate update-induced variability. Specifically, we model state-action return distribution through a distributional critic and then bias the advantage function of PPO using higher-order moments (skewness and kurtosis) of this distribution. By penalizing extreme tail behaviors, our method discourages policies from entering parameter regimes prone to instability. We hypothesize that in environments where post-update critic values align poorly with post-update returns, standard PPO struggles to produce a narrow $R(θ)$. In such cases, our moment-based correction narrows $R(θ)$, improving stability by up to 75% in Walker2D, while preserving comparable evaluation returns.
翻译:深度强化学习(RL)智能体习得的策略在实现相同回合回报时,其行为可能差异显著,这源于环境因素(随机转移、初始条件、奖励噪声)与算法因素(小批量选择、探索噪声)的共同作用。在连续控制任务中,即使微小的参数偏移也可能导致步态失稳,从而为算法比较和现实世界迁移带来困难。先前研究表明,当策略更新穿越噪声邻域时,此类不稳定性便会出现;而通过重复采样小批量、更新参数θ并测量最终回报所得的后更新回报分布$R(θ)$的离散程度,正是衡量此类噪声的有效指标。虽然显式约束策略以维持较窄的$R(θ)$能提升稳定性,但在高维场景中直接估计$R(θ)$的计算代价高昂。我们提出一种替代方案,利用环境随机性来缓解更新引发的波动性。具体而言,我们通过分布评论家对状态-动作回报分布进行建模,继而使用该分布的高阶矩(偏度与峰度)对PPO的优势函数进行偏置修正。通过惩罚极端尾部行为,我们的方法阻止策略进入易引发不稳定的参数区域。我们假设,在后更新评论家值与后更新回报匹配度较差的环境中,标准PPO难以生成较窄的$R(θ)$。在此类场景中,我们基于矩量的修正方法能有效收窄$R(θ)$,在Walker2D环境中将稳定性提升最高达75%,同时保持相当的评估回报水平。