Distributional reinforcement learning algorithms have attempted to utilize estimated uncertainty for exploration, such as optimism in the face of uncertainty. However, using the estimated variance for optimistic exploration may cause biased data collection and hinder convergence or performance. In this paper, we present a novel distributional reinforcement learning algorithm that selects actions by randomizing risk criterion to avoid one-sided tendency on risk. We provide a perturbed distributional Bellman optimality operator by distorting the risk measure and prove the convergence and optimality of the proposed method with the weaker contraction property. Our theoretical results support that the proposed method does not fall into biased exploration and is guaranteed to converge to an optimal return. Finally, we empirically show that our method outperforms other existing distribution-based algorithms in various environments including Atari 55 games.
翻译:分布式强化学习算法曾尝试利用估计的不确定性进行探索,例如面对不确定性时的乐观主义。然而,使用估计的方差进行乐观探索可能导致有偏的数据收集,阻碍收敛或性能。本文提出一种新颖的分布式强化学习算法,通过随机化风险准则来选择动作,以避免对风险的单边倾向。我们通过扭曲风险度量来构建受扰动的分布式贝尔曼最优算子,并证明了所提方法在较弱压缩性质下的收敛性和最优性。理论结果支持所提方法不会陷入有偏探索,且能保证收敛到最优回报。最后,我们通过实验证明,该方法在包括Atari 55个游戏在内的多种环境中优于其他现有的基于分布的算法。