Distributional reinforcement learning algorithms have attempted to utilize estimated uncertainty for exploration, such as optimism in the face of uncertainty. However, using the estimated variance for optimistic exploration may cause biased data collection and hinder convergence or performance. In this paper, we present a novel distributional reinforcement learning algorithm that selects actions by randomizing risk criterion to avoid one-sided tendency on risk. We provide a perturbed distributional Bellman optimality operator by distorting the risk measure and prove the convergence and optimality of the proposed method with the weaker contraction property. Our theoretical results support that the proposed method does not fall into biased exploration and is guaranteed to converge to an optimal return. Finally, we empirically show that our method outperforms other existing distribution-based algorithms in various environments including Atari 55 games.
翻译:分布式强化学习算法试图利用估计的不确定性进行探索,例如面对不确定性时保持乐观。然而,使用估计的方差进行乐观探索可能会导致有偏的数据收集,并阻碍收敛或性能。在本文中,我们提出了一种新颖的分布式强化学习算法,该算法通过随机化风险准则来选择动作,以避免对风险的片面倾向。我们通过扭曲风险度量提出了一种扰动分布式贝尔曼最优算子,并证明了所提方法在较弱的压缩性质下的收敛性和最优性。我们的理论结果支持所提方法不会陷入有偏探索,并保证收敛到最优回报。最后,我们通过实验证明,我们的方法在包括Atari 55个游戏在内的各种环境中优于其他现有的基于分布的算法。