Reinforcement learning (RL) has proven to be highly effective in tackling complex decision-making and control tasks. However, prevalent model-free RL methods often face severe performance degradation due to the well-known overestimation issue. In response to this problem, we recently introduced an off-policy RL algorithm, called distributional soft actor-critic (DSAC or DSAC-v1), which can effectively improve the value estimation accuracy by learning a continuous Gaussian value distribution. Nonetheless, standard DSAC has its own shortcomings, including occasionally unstable learning processes and needs for task-specific reward scaling, which may hinder its overall performance and adaptability in some special tasks. This paper further introduces three important refinements to standard DSAC in order to address these shortcomings. These refinements consist of critic gradient adjusting, twin value distribution learning, and variance-based target return clipping. The modified RL algorithm is named as DSAC with three refinements (DSAC-T or DSAC-v2), and its performances are systematically evaluated on a diverse set of benchmark tasks. Without any task-specific hyperparameter tuning, DSAC-T surpasses a lot of mainstream model-free RL algorithms, including SAC, TD3, DDPG, TRPO, and PPO, in all tested environments. Additionally, DSAC-T, unlike its standard version, ensures a highly stable learning process and delivers similar performance across varying reward scales.
翻译:强化学习(RL)已被证明在解决复杂决策与控制任务方面非常有效。然而,常见的无模型RL方法常因众所周知的过估计问题而面临严重的性能退化。为解决该问题,我们近期提出了一种离策略RL算法,称为分布式软演员-评论家(DSAC或DSAC-v1),通过学习连续高斯值分布,有效提升了值估计精度。但标准DSAC存在自身局限性,包括偶发的学习过程不稳定以及对任务特定奖励缩放的需求,这可能阻碍其在某些特殊任务中的整体性能与适应性。本文针对标准DSAC引入三项重要改进以克服这些不足:评论家梯度调整、孪生值分布学习以及基于方差的目标回报裁剪。改进后的RL算法命名为含三种改进的DSAC(DSAC-T或DSAC-v2),并在多样化基准任务上系统评估其性能。无需任何任务特定超参数调整,DSAC-T在所有测试环境中均超越诸多主流无模型RL算法(包括SAC、TD3、DDPG、TRPO和PPO)。此外,与标准版本不同,DSAC-T确保了高度稳定的学习过程,并在不同奖励尺度下展现出相似的性能。