This paper proposes an agent-based optimistic policy iteration (OPI) scheme for learning stationary optimal stochastic policies in multi-agent Markov Decision Processes (MDPs), in which agents incur a Kullback-Leibler (KL) divergence cost for their control efforts and an additional cost for the joint state. The proposed scheme consists of a greedy policy improvement step followed by an m-step temporal difference (TD) policy evaluation step. We use the separable structure of the instantaneous cost to show that the policy improvement step follows a Boltzmann distribution that depends on the current value function estimate and the uncontrolled transition probabilities. This allows agents to compute the improved joint policy independently. We show that both the synchronous (entire state space evaluation) and asynchronous (a uniformly sampled set of substates) versions of the OPI scheme with finite policy evaluation rollout converge to the optimal value function and an optimal joint policy asymptotically. Simulation results on a multi-agent MDP with KL control cost variant of the Stag-Hare game validates our scheme's performance in terms of minimizing the cost return.
翻译:本文提出一种基于智能体的乐观策略迭代方案,用于学习多智能体马尔可夫决策过程中具有Kullback-Leibler散度控制成本的平稳最优随机策略。该方案包含贪婪策略改进步骤与m步时序差分策略评估步骤。利用瞬时成本的可分离结构,我们证明策略改进步骤服从取决于当前值函数估计与无控制转移概率的玻尔兹曼分布,使得智能体可独立计算改进的联合策略。我们证明:在有限策略评估推演下,同步版本与异步版本的OPI方案均能渐近收敛至最优值函数与最优联合策略。通过在Stag-Hare博弈的KL控制成本变体上进行多智能体MDP仿真,验证了本方案在最小化成本回报方面的性能。