In this paper, we study the problem of robust cooperative multi-agent reinforcement learning (RL) where a large number of cooperative agents with distributed information aim to learn policies in the presence of \emph{stochastic} and \emph{non-stochastic} uncertainties whose distributions are respectively known and unknown. Focusing on policy optimization that accounts for both types of uncertainties, we formulate the problem in a worst-case (minimax) framework, which is is intractable in general. Thus, we focus on the Linear Quadratic setting to derive benchmark solutions. First, since no standard theory exists for this problem due to the distributed information structure, we utilize the Mean-Field Type Game (MFTG) paradigm to establish guarantees on the solution quality in the sense of achieved Nash equilibrium of the MFTG. This in turn allows us to compare the performance against the corresponding original robust multi-agent control problem. Then, we propose a Receding-horizon Gradient Descent Ascent RL algorithm to find the MFTG Nash equilibrium and we prove a non-asymptotic rate of convergence. Finally, we provide numerical experiments to demonstrate the efficacy of our approach relative to a baseline algorithm.
翻译:本文研究鲁棒协作多智能体强化学习问题,其中大量具有分布式信息的协作智能体需要在存在**随机性**与**非随机性**不确定性的情况下学习策略,这两类不确定性的分布分别为已知与未知。聚焦于同时考虑两类不确定性的策略优化,我们将问题表述为一个最坏情况(极小极大)框架,该框架通常难以求解。因此,我们集中于线性二次设定以推导基准解。首先,由于分布式信息结构导致该问题缺乏标准理论,我们利用平均场类型博弈范式,在实现MFTG纳什均衡的意义上为解的质量建立保证。这进而使我们能够将性能与对应的原始鲁棒多智能体控制问题进行比较。随后,我们提出一种滚动时域梯度下降上升强化学习算法以寻找MFTG纳什均衡,并证明了非渐近收敛速率。最后,我们通过数值实验验证所提方法相对于基线算法的有效性。