Competitions for shareable and limited resources have long been studied with strategic agents. In reality, agents often have to learn and maximize the rewards of the resources at the same time. To design an individualized competing policy, we model the competition between agents in a novel multi-player multi-armed bandit (MPMAB) setting where players are selfish and aim to maximize their own rewards. In addition, when several players pull the same arm, we assume that these players averagely share the arms' rewards by expectation. Under this setting, we first analyze the Nash equilibrium when arms' rewards are known. Subsequently, we propose a novel SelfishMPMAB with Averaging Allocation (SMAA) approach based on the equilibrium. We theoretically demonstrate that SMAA could achieve a good regret guarantee for each player when all players follow the algorithm. Additionally, we establish that no single selfish player can significantly increase their rewards through deviation, nor can they detrimentally affect other players' rewards without incurring substantial losses for themselves. We finally validate the effectiveness of the method in extensive synthetic experiments.
翻译:关于可共享且有限资源的竞争,长期以来一直在战略型智能体的背景下被研究。然而在现实中,智能体往往需要在同一时间学习资源并最大化其奖励。为设计个体化的竞争策略,我们在一种新颖的多玩家多臂赌博机(MPMAB)设定中对智能体之间的竞争进行建模,其中玩家是自私的,旨在最大化自身收益。此外,当多个玩家拉动同一臂时,我们假设这些玩家按期望平均共享该臂的奖励。在此设定下,我们首先分析了当臂的奖励已知时的纳什均衡。随后,基于该均衡,我们提出了一种新颖的基于平均分配的自私型MPMAB方法(SMAA)。我们从理论上证明,当所有玩家均遵循该算法时,SMAA能为每个玩家实现良好的遗憾保证。此外,我们确定任何单个自私玩家都无法通过偏离策略显著增加自身收益,也无法在不遭受重大损失的情况下损害其他玩家的收益。最后,我们通过大量合成实验验证了该方法的有效性。