This paper presents AFU, an off-policy deep RL algorithm addressing in a new way the challenging "max-Q problem" in Q-learning for continuous action spaces, with a solution based on regression and conditional gradient scaling. AFU has an actor but its critic updates are entirely independent from it. As a consequence, the actor can be chosen freely. In the initial version, AFU-alpha, we employ the same stochastic actor as in Soft Actor-Critic (SAC), but we then study a simple failure mode of SAC and show how AFU can be modified to make actor updates less likely to become trapped in local optima, resulting in a second version of the algorithm, AFU-beta. Experimental results demonstrate the sample efficiency of both versions of AFU, marking it as the first model-free off-policy algorithm competitive with state-of-the-art actor-critic methods while departing from the actor-critic perspective.
翻译:本文提出AFU算法,一种离策略深度强化学习算法,通过基于回归和条件梯度缩放的新方法解决连续动作空间Q学习中具有挑战性的“最大Q值问题”。AFU虽包含演员模块,但其评论家更新完全独立于演员。因此,演员模块可自由选择。在初始版本AFU-alpha中,我们采用与软演员-评论家(SAC)相同的随机演员,随后通过分析SAC的典型故障模式,展示了如何改进AFU的演员更新机制以避免陷入局部最优,从而形成算法的第二版本AFU-beta。实验结果表明两个版本的AFU均具有优异的样本效率,标志着该算法成为首个与先进演员-评论家方法性能相当、同时完全脱离演员-评论家框架的无模型离策略算法。