Distributional reinforcement learning algorithms have attempted to utilize estimated uncertainty for exploration, such as optimism in the face of uncertainty. However, using the estimated variance for optimistic exploration may cause biased data collection and hinder convergence or performance. In this paper, we present a novel distributional reinforcement learning algorithm that selects actions by randomizing risk criterion to avoid one-sided tendency on risk. We provide a perturbed distributional Bellman optimality operator by distorting the risk measure and prove the convergence and optimality of the proposed method with the weaker contraction property. Our theoretical results support that the proposed method does not fall into biased exploration and is guaranteed to converge to an optimal return. Finally, we empirically show that our method outperforms other existing distribution-based algorithms in various environments including Atari 55 games.
翻译:分布强化学习算法曾尝试利用估计的不确定性进行探索,例如"面对不确定性时的乐观主义"。然而,使用估计方差进行乐观探索可能导致有偏的数据收集,并阻碍收敛或性能。本文提出一种新颖的分布强化学习算法,通过随机化风险准则来选择动作,以避免对风险的片面倾向。我们通过扭曲风险度量构建了扰动分布最优贝尔曼算子,并证明了所提方法在弱收缩性质下的收敛性与最优性。理论结果表明,该方法不会陷入有偏探索,并保证收敛到最优回报。最后,我们在包括Atari 55款游戏在内的多种环境中实证表明,该方法优于其他现有基于分布的算法。