Off-policy actor-critic algorithms have shown promise in deep reinforcement learning for continuous control tasks. Their success largely stems from leveraging pessimistic state-action value function updates, which effectively address function approximation errors and improve performance. However, such pessimism can lead to under-exploration, constraining the agent's ability to explore/refine its policies. Conversely, optimism can counteract under-exploration, but it also carries the risk of excessive risk-taking and poor convergence if not properly balanced. Based on these insights, we introduce Utility Soft Actor-Critic (USAC), a novel framework within the actor-critic paradigm that enables independent control over the degree of pessimism/optimism for both the actor and the critic via interpretable parameters. USAC adapts its exploration strategy based on the uncertainty of critics through a utility function that allows us to balance between pessimism and optimism separately. By going beyond binary choices of optimism and pessimism, USAC represents a significant step towards achieving balance within off-policy actor-critic algorithms. Our experiments across various continuous control problems show that the degree of pessimism or optimism depends on the nature of the task. Furthermore, we demonstrate that USAC can outperform state-of-the-art algorithms for appropriately configured pessimism/optimism parameters.
翻译:离策略执行者-评论者算法在连续控制任务的深度强化学习中展现出潜力。其成功很大程度上源于利用悲观的状态-动作价值函数更新,这有效解决了函数逼近误差并提升了性能。然而,这种悲观性可能导致探索不足,限制智能体探索/优化其策略的能力。相反,乐观性可以抵消探索不足,但若未恰当平衡,也可能带来过度冒险和收敛不良的风险。基于这些见解,我们提出了效用软执行者-评论者(USAC),这是执行者-评论者范式中的一个新颖框架,通过可解释参数实现对执行者和评论者悲观/乐观程度的独立控制。USAC通过一个效用函数,根据评论者的不确定性调整其探索策略,该函数允许我们分别平衡悲观与乐观。通过超越乐观与悲观的二元选择,USAC代表了在离策略执行者-评论者算法中实现平衡的重要一步。我们在各种连续控制问题上的实验表明,悲观或乐观的程度取决于任务的性质。此外,我们证明,在适当配置悲观/乐观参数的情况下,USAC能够超越最先进的算法。