While there has been substantial success for solving continuous control with actor-critic methods, simpler critic-only methods such as Q-learning find limited application in the associated high-dimensional action spaces. However, most actor-critic methods come at the cost of added complexity: heuristics for stabilisation, compute requirements and wider hyperparameter search spaces. We show that a simple modification of deep Q-learning largely alleviates these issues. By combining bang-bang action discretization with value decomposition, framing single-agent control as cooperative multi-agent reinforcement learning (MARL), this simple critic-only approach matches performance of state-of-the-art continuous actor-critic methods when learning from features or pixels. We extend classical bandit examples from cooperative MARL to provide intuition for how decoupled critics leverage state information to coordinate joint optimization, and demonstrate surprisingly strong performance across a variety of continuous control tasks.
翻译:尽管基于演员-评论家的方法在解决连续控制问题方面已取得显著成功,但更简单的纯评论家方法(如Q学习)在相关高维动作空间中应用有限。然而,大多数演员-评论家方法以增加复杂性为代价:需要稳定性启发式策略、更高计算需求以及更宽的超参数搜索空间。我们证明,对深度Q学习进行简单修改即可大幅缓解这些问题。通过将"砰砰"动作离散化与价值分解相结合,将单智能体控制问题框架化为协作式多智能体强化学习(MARL),这种简单的纯评论家方法在基于特征或像素学习时,能够匹配最先进的连续演员-评论家方法的性能。我们扩展了协作式MARL中的经典老虎机示例,以直观理解解耦评论家如何利用状态信息协调联合优化,并在多种连续控制任务中展示了令人惊讶的强大性能。