We study the adaption of Soft Actor-Critic (SAC), which is considered as a state-of-the-art reinforcement learning (RL) algorithm, from continuous action space to discrete action space. We revisit vanilla discrete SAC and provide an in-depth understanding of its Q value underestimation and performance instability issues when applied to discrete settings. We thereby propose Stable Discrete SAC (SDSAC), an algorithm that leverages entropy-penalty and double average Q-learning with Q-clip to address these issues. Extensive experiments on typical benchmarks with discrete action space, including Atari games and a large-scale MOBA game, show the efficacy of our proposed method. Our code is at: https://github.com/coldsummerday/SD-SAC.git.
翻译:本研究探讨了将软演员-评论家算法——一种被视为最先进的强化学习算法——从连续动作空间适应到离散动作空间的过程。我们重新审视了原始离散SAC算法,深入分析了其在离散场景下存在的Q值低估与性能不稳定问题。基于此,我们提出稳定离散SAC算法,该方法通过引入熵惩罚机制、结合双平均Q学习与Q值裁剪策略来解决上述问题。在包含Atari游戏及大型多人在线战术竞技游戏在内的典型离散动作空间基准测试中,大量实验验证了所提方法的有效性。代码已开源:https://github.com/coldsummerday/SD-SAC.git。