We consider a cooperative multiplayer bandit learning problem where the players are only allowed to agree on a strategy beforehand, but cannot communicate during the learning process. In this problem, each player simultaneously selects an action. Based on the actions selected by all players, the team of players receives a reward. The actions of all the players are commonly observed. However, each player receives a noisy version of the reward which cannot be shared with other players. Since players receive potentially different rewards, there is an asymmetry in the information used to select their actions. In this paper, we provide an algorithm based on upper and lower confidence bounds that the players can use to select their optimal actions despite the asymmetry in the reward information. We show that this algorithm can achieve logarithmic $O(\frac{\log T}{\Delta_{\bm{a}}})$ (gap-dependent) regret as well as $O(\sqrt{T\log T})$ (gap-independent) regret. This is asymptotically optimal in $T$. We also show that it performs empirically better than the current state of the art algorithm for this environment.
翻译:本文研究协同多玩家赌博机学习问题,其中玩家仅允许事先商定策略,但在学习过程中无法进行通信。在该问题中,每位玩家同时选择一个动作。基于所有玩家选择的动作,玩家团队获得一个奖励。所有玩家的动作均可被共同观测。然而,每位玩家仅能接收到一个无法与其他玩家共享的带噪声版本的奖励。由于玩家接收到的奖励可能不同,其用于选择动作的信息存在不对称性。本文提出一种基于上置信界与下置信界的算法,使得玩家即使在奖励信息不对称的情况下也能选择最优动作。我们证明该算法能实现对数级$O(\frac{\log T}{\Delta_{\bm{a}}})$(依赖间隔)遗憾界以及$O(\sqrt{T\log T})$(独立于间隔)遗憾界,在$T$上渐近最优。实验表明,该算法在此环境下性能优于当前最优方法。