We consider a decentralized multi-agent Multi Armed Bandit (MAB) setup consisting of $N$ agents, solving the same MAB instance to minimize individual cumulative regret. In our model, agents collaborate by exchanging messages through pairwise gossip style communications on an arbitrary connected graph. We develop two novel algorithms, where each agent only plays from a subset of all the arms. Agents use the communication medium to recommend only arm-IDs (not samples), and thus update the set of arms from which they play. We establish that, if agents communicate $\Omega(\log(T))$ times through any connected pairwise gossip mechanism, then every agent's regret is a factor of order $N$ smaller compared to the case of no collaborations. Furthermore, we show that the communication constraints only have a second order effect on the regret of our algorithm. We then analyze this second order term of the regret to derive bounds on the regret-communication tradeoffs. Finally, we empirically evaluate our algorithm and conclude that the insights are fundamental and not artifacts of our bounds. We also show a lower bound which gives that the regret scaling obtained by our algorithm cannot be improved even in the absence of any communication constraints. Our results thus demonstrate that even a minimal level of collaboration among agents greatly reduces regret for all agents.
翻译:我们研究一个去中心化的多智能体多臂赌博机(MAB)设置,该设置包含$N$个智能体,它们通过求解相同的MAB实例来最小化各自的累积遗憾。在我们的模型中,智能体通过在任意连通图上进行成对流言式通信来交换消息,从而实现协作。我们开发了两种新颖的算法,其中每个智能体仅从所有臂的一个子集中进行选择。智能体利用通信媒介仅推荐臂的标识符(而非样本),从而更新它们进行选择的臂的集合。我们证明,如果智能体通过任意连通的成对流言机制进行$\Omega(\log(T))$次通信,那么每个智能体的遗憾将比无协作情况下的遗憾小一个数量级为$N$的因子。此外,我们表明通信约束仅对我们算法的遗憾产生二阶影响。随后,我们分析了遗憾的这一二阶项,以推导遗憾-通信权衡的界限。最后,我们通过实验评估了我们的算法,并得出结论:这些见解是根本性的,而非我们理论界限的产物。我们还展示了一个下界,表明即使没有任何通信约束,我们算法所获得的遗憾缩放也无法进一步改进。因此,我们的结果证明,即使智能体之间进行最低限度的协作,也能显著降低所有智能体的遗憾。