This study proposes the use of a social learning method to estimate a global state within a multi-agent off-policy actor-critic algorithm for reinforcement learning (RL) operating in a partially observable environment. We assume that the network of agents operates in a fully-decentralized manner, possessing the capability to exchange variables with their immediate neighbors. The proposed design methodology is supported by an analysis demonstrating that the difference between final outcomes, obtained when the global state is fully observed versus estimated through the social learning method, is $\varepsilon$-bounded when an appropriate number of iterations of social learning updates are implemented. Unlike many existing dec-POMDP-based RL approaches, the proposed algorithm is suitable for model-free multi-agent reinforcement learning as it does not require knowledge of a transition model. Furthermore, experimental results illustrate the efficacy of the algorithm and demonstrate its superiority over the current state-of-the-art methods.
翻译:本研究提出在部分可观测环境下运行的多智能体离策略行动者-评论家强化学习算法中,采用一种社会学习方法以估计全局状态。我们假设智能体网络以完全去中心化的方式运行,并具备与其直接邻居交换变量的能力。所提出的设计方法得到了一项分析的支持,该分析表明:当实施适当次数的社会学习更新迭代时,全局状态被完全观测与通过社会学习方法进行估计所获得的最终结果之间的差异是ε有界的。与许多现有的基于分散式部分可观测马尔可夫决策过程的强化学习方法不同,所提出的算法适用于无模型多智能体强化学习,因为它不需要已知状态转移模型。此外,实验结果说明了该算法的有效性,并证明了其优于当前最先进方法的性能。