In this paper, we introduce a novel method for enhancing the effectiveness of on-policy Deep Reinforcement Learning (DRL) algorithms. Current on-policy algorithms, such as Proximal Policy Optimization (PPO) and Asynchronous Advantage Actor-Critic (A3C), do not sufficiently account for cautious interaction with the environment. Our method addresses this gap by explicitly integrating cautious interaction in two critical ways: by maximizing a lower-bound on the true value function plus a constant, thereby promoting a \textit{conservative value estimation}, and by incorporating Thompson sampling for cautious exploration. These features are realized through three surprisingly simple modifications to the A3C algorithm: processing advantage estimates through a ReLU function, spectral normalization, and dropout. We provide theoretical proof that our algorithm maximizes the lower bound, which also grounds Regret Matching Policy Gradients (RMPG), a discrete-action on-policy method for multi-agent reinforcement learning. Our rigorous empirical evaluations across various benchmarks consistently demonstrates our approach's improved performance against existing on-policy algorithms. This research represents a substantial step towards more cautious and effective DRL algorithms, which has the potential to unlock application to complex, real-world problems.
翻译:本文提出一种新颖方法,用以提升同策略深度强化学习(DRL)算法的效能。现有同策略算法(如近端策略优化 PPO 与异步优势演员-评论家 A3C)未能充分计及与环境交互中的审慎性。我们的方法通过两条关键路径明确融入审慎交互:最大化真实值函数下界(附加常数项)以促进**保守值估计**,以及引入汤普森采样实现审慎探索。这些特性通过对 A3C 算法进行三项惊人简化的修改实现:将优势估计经 ReLU 函数处理、谱归一化及 dropout。我们提供理论证明,表明该算法最大化了下界,同时为多智能体强化学习中离散动作同策略方法——遗憾匹配策略梯度(RMPG)奠定了理论基础。跨多个基准的严格实证评估一致表明,该方法相较于现有同策略算法性能更优。本研究标志着向更审慎高效的 DRL 算法迈出重要一步,有望解锁其在复杂现实问题中的应用。