Reinforcement learning (RL) has proven to be highly effective in tackling complex decision-making and control tasks. However, prevalent model-free RL methods often face severe performance degradation due to the well-known overestimation issue. In response to this problem, we recently introduced an off-policy RL algorithm, called distributional soft actor-critic (DSAC or DSAC-v1), which can effectively improve the value estimation accuracy by learning a continuous Gaussian value distribution. Nonetheless, standard DSAC has its own shortcomings, including occasionally unstable learning processes and the necessity for task-specific reward scaling, which may hinder its overall performance and adaptability in some special tasks. This paper further introduces three important refinements to standard DSAC in order to address these shortcomings. These refinements consist of expected value substituting, twin value distribution learning, and variance-based critic gradient adjusting. The modified RL algorithm is named as DSAC with three refinements (DSAC-T or DSAC-v2), and its performances are systematically evaluated on a diverse set of benchmark tasks. Without any task-specific hyperparameter tuning, DSAC-T surpasses or matches a lot of mainstream model-free RL algorithms, including SAC, TD3, DDPG, TRPO, and PPO, in all tested environments. Additionally, DSAC-T, unlike its standard version, ensures a highly stable learning process and delivers similar performance across varying reward scales.
翻译:强化学习已被证明在解决复杂决策与控制任务中具有极高有效性。然而,常见无模型强化学习方法常因众所周知的过估计问题而面临严重的性能退化。针对此问题,我们近期提出了一种离策略强化学习算法——分布式软演员-评论家(DSAC或DSAC-v1),通过学习连续高斯值分布,有效提升了值估计精度。但标准DSAC存在自身缺陷,包括偶发的不稳定学习过程及需针对特定任务调整奖励缩放,这可能在某些特殊任务中影响其整体性能与适应性。本文针对标准DSAC进一步引入三项重要改进以解决上述缺陷:期望值替代、孪生值分布学习及基于方差的评论家梯度调整。改进后的强化学习算法命名为DSAC-T(DSAC-v2),并在多样化基准任务上系统性评估其性能。无需任何任务特定超参数调优,DSAC-T在所有测试环境中均超越或持平包括SAC、TD3、DDPG、TRPO及PPO在内的主流无模型强化学习算法。此外,DSAC-T区别于标准版本,能确保高度稳定的学习过程,并在不同奖励缩放条件下保持相似性能。