Most works on multi-agent reinforcement learning focus on scenarios where the state of the environment is fully observable. In this work, we consider a cooperative policy evaluation task in which agents are not assumed to observe the environment state directly. Instead, agents can only have access to noisy observations and to belief vectors. It is well-known that finding global posterior distributions under multi-agent settings is generally NP-hard. As a remedy, we propose a fully decentralized belief forming strategy that relies on individual updates and on localized interactions over a communication network. In addition to the exchange of the beliefs, agents exploit the communication network by exchanging value function parameter estimates as well. We analytically show that the proposed strategy allows information to diffuse over the network, which in turn allows the agents' parameters to have a bounded difference with a centralized baseline. A multi-sensor target tracking application is considered in the simulations.
翻译:大多数关于多智能体强化学习的研究聚焦于环境状态完全可观测的场景。本文考虑一种协作式策略评估任务,其中假设智能体无法直接观测环境状态,仅能获取含噪观测值与信念向量。众所周知,多智能体环境下求解全局后验分布通常为NP难问题。为此,我们提出一种完全分布式的信念形成策略,该策略依赖个体更新与通信网络上的局部交互。除信念交换外,智能体还通过网络通信交换价值函数参数估计值。理论分析表明,所提策略能使信息在网络上扩散,进而使智能体参数与集中式基准之间的差异保持有界。仿真实验中考虑了多传感器目标跟踪应用场景。