This work designs and analyzes a novel set of algorithms for multi-agent reinforcement learning (MARL) based on the principle of information-directed sampling (IDS). These algorithms draw inspiration from foundational concepts in information theory, and are proven to be sample efficient in MARL settings such as two-player zero-sum Markov games (MGs) and multi-player general-sum MGs. For episodic two-player zero-sum MGs, we present three sample-efficient algorithms for learning Nash equilibrium. The basic algorithm, referred to as MAIDS, employs an asymmetric learning structure where the max-player first solves a minimax optimization problem based on the joint information ratio of the joint policy, and the min-player then minimizes the marginal information ratio with the max-player's policy fixed. Theoretical analyses show that it achieves a Bayesian regret of tilde{O}(sqrt{K}) for K episodes. To reduce the computational load of MAIDS, we develop an improved algorithm called Reg-MAIDS, which has the same Bayesian regret bound while enjoying less computational complexity. Moreover, by leveraging the flexibility of IDS principle in choosing the learning target, we propose two methods for constructing compressed environments based on rate-distortion theory, upon which we develop an algorithm Compressed-MAIDS wherein the learning target is a compressed environment. Finally, we extend Reg-MAIDS to multi-player general-sum MGs and prove that it can learn either the Nash equilibrium or coarse correlated equilibrium in a sample efficient manner.
翻译:本文基于信息导向采样(IDS)原则,设计并分析了一类用于多智能体强化学习(MARL)的新算法。这些算法借鉴信息论中的基础概念,并被证明在两人零和马尔可夫博弈(MG)及多人一般和MG等MARL场景中具有样本高效性。针对回合制两人零和MG,我们提出了三种用于学习纳什均衡的样本高效算法。基础算法MAIDS采用非对称学习结构:最大玩家首先基于联合策略的联合信息比求解极小极大优化问题,随后最小玩家在固定最大玩家策略的条件下最小化边际信息比。理论分析表明,该算法在K个回合中实现了tilde{O}(√K)的贝叶斯遗憾值。为降低MAIDS的计算负荷,我们提出改进算法Reg-MAIDS,其在保持相同贝叶斯遗憾界的同时降低了计算复杂度。进一步,利用IDS原则在学习目标选择上的灵活性,我们基于率失真理论提出两种构建压缩环境的方法,并据此开发算法Compressed-MAIDS,其学习目标为压缩环境。最后,我们将Reg-MAIDS扩展至多人一般和MG,并证明其能以样本高效的方式学习纳什均衡或粗相关均衡。