Fully decentralized learning, where the global information, i.e., the actions of other agents, is inaccessible, is a fundamental challenge in cooperative multi-agent reinforcement learning. However, the convergence and optimality of most decentralized algorithms are not theoretically guaranteed, since the transition probabilities are non-stationary as all agents are updating policies simultaneously. To tackle this challenge, we propose best possible operator, a novel decentralized operator, and prove that the policies of agents will converge to the optimal joint policy if each agent independently updates its individual state-action value by the operator. Further, to make the update more efficient and practical, we simplify the operator and prove that the convergence and optimality still hold with the simplified one. By instantiating the simplified operator, the derived fully decentralized algorithm, best possible Q-learning (BQL), does not suffer from non-stationarity. Empirically, we show that BQL achieves remarkable improvement over baselines in a variety of cooperative multi-agent tasks.
翻译:完全去中心化学习(即全局信息——其他智能体的行为不可获取)是多智能体协作强化学习中的根本性挑战。然而,由于所有智能体同步更新策略导致转移概率具有非平稳性,大多数去中心化算法的收敛性与最优性缺乏理论保证。为解决这一挑战,我们提出新型去中心化算子——最优可能算子(best possible operator),并证明若每个智能体通过该算子独立更新其个体状态-动作值,则智能体策略将收敛至最优联合策略。进一步地,为提升更新的效率与实用性,我们对算子进行简化,并证明简化后的算子仍保持收敛性与最优性。通过实例化该简化算子,所导出的完全去中心化算法——最优可能的Q学习(BQL)能够规避非平稳性问题。实验结果表明,在多种协作式多智能体任务中,BQL相较于基线方法取得了显著提升。