Learning a predictive model of the mean return, or value function, plays a critical role in many reinforcement learning algorithms. Distributional reinforcement learning (DRL) methods instead model the value distribution, which has been shown to improve performance in many settings. In this paper, we model the value distribution as approximately normal using the Markov Chain central limit theorem. We analytically compute quantile bars to provide a new DRL target that is informed by the decrease in standard deviation that occurs over the course of an episode. In addition, we propose a policy update strategy based on uncertainty as measured by structural characteristics of the value distribution not present in the standard value function. The approach we outline is compatible with many DRL structures. We use two representative on-policy algorithms, PPO and TRPO, as testbeds and show that our methods produce performance improvements in continuous control tasks.
翻译:学习平均回报的预测模型(即价值函数)在众多强化学习算法中扮演着关键角色。分布强化学习方法则建模价值分布,已被证明能在多种场景下提升性能。本文基于马尔可夫链中心极限定理,将价值分布近似为正态分布。我们通过解析计算分位数置信区间,提出了一种新的分布强化学习目标,该目标利用回合进行中标准差下降的特性。此外,我们还提出了一种基于不确定性(由价值分布中标准价值函数所不具备的结构特征度量)的策略更新策略。我们提出的方法兼容多种分布强化学习框架。以两种代表性同策略算法PPO和TRPO为测试平台,实验表明我们的方法在连续控制任务中带来了性能提升。