We study Bayesian learning in episodic, finite-horizon zero-sum Markov games with unknown transition and reward models. We investigate a posterior algorithm in which each player maintains a Bayesian posterior over the game model, independently samples a game model at the beginning of each episode, and computes an equilibrium policy for the sampled model. We analyze two settings: (i) Both players use the posterior sampling algorithm, and (ii) Only one player uses posterior sampling while the opponent follows an arbitrary learning algorithm. In each setting, we provide guarantees on the expected regret of the posterior sampling agent. Our notion of regret compares the expected total reward of the learning agent against the expected total reward under equilibrium policies of the true game. Our main theoretical result is an expected regret bound for the posterior sampling agent of order $O(HS\sqrt{ABHK\log(SABHK)})$ where $K$ is the number of episodes, $H$ is the episode length, $S$ is the number of states, and $A,B$ are the action space sizes of the two players. Experiments in a grid-world predator--prey domain illustrate the sublinear regret scaling and show that posterior sampling competes favorably with a fictitious-play baseline.
翻译:我们研究了回合制有限时域零和马尔可夫博弈中的贝叶斯学习,其中转移模型和奖励模型均未知。我们提出了一种后验算法:每个参与者维护一个关于博弈模型的贝叶斯后验分布,在每个回合开始时独立地对一个博弈模型进行采样,并为采样模型计算均衡策略。我们分析了两种场景:(i) 双方参与者均使用后验采样算法;(ii) 仅一方参与者使用后验采样,而对手遵循任意学习算法。在每个场景中,我们为后验采样代理的期望后悔提供了保证。我们的后悔度将学习代理的期望总收益与真实博弈均衡策略下的期望总收益进行比较。主要理论结果是后验采样代理的期望后悔边界为 $O(HS\sqrt{ABHK\log(SABHK)})$,其中 $K$ 为回合数,$H$ 为回合长度,$S$ 为状态数,$A,B$ 为双方参与者的动作空间大小。在网格世界捕食者-猎物领域的实验中,我们验证了次线性后悔缩放特性,并表明后验采样与虚拟博弈基线相比具有竞争优势。