Stochastic games are a popular framework for studying multi-agent reinforcement learning (MARL). Recent advances in MARL have focused primarily on games with finitely many states. In this work, we study multi-agent learning in stochastic games with general state spaces and an information structure in which agents do not observe each other's actions. In this context, we propose a decentralized MARL algorithm and we prove the near-optimality of its policy updates. Furthermore, we study the global policy-updating dynamics for a general class of best-reply based algorithms and derive a closed-form characterization of convergence probabilities over the joint policy space.
翻译:随机博弈是研究多智能体强化学习(MARL)的常用框架。近年来MARL的进展主要集中于有限状态博弈。本文研究具有一般状态空间及智能体彼此未观测到对方动作这一信息结构的多智能体学习问题。针对该背景,我们提出一种分散式MARL算法,并证明其策略更新的近最优性。此外,我们进一步研究了一类基于最优反应策略的通用算法的全局策略更新动态,并在联合策略空间上推导出收敛概率的闭式表征。