We study a cooperative multi-agent bandit setting in the distributed GOSSIP model: in every round, each of $n$ agents chooses an action from a common set, observes the action's corresponding reward, and subsequently exchanges information with a single randomly chosen neighbor, which may inform its choice in the next round. We introduce and analyze families of memoryless and time-independent protocols for this setting, inspired by opinion dynamics that are well-studied for other algorithmic tasks in the GOSSIP model. For stationary reward settings, we prove for the first time that these simple protocols exhibit best-of-both-worlds behavior, simultaneously obtaining constant cumulative regret scaling like $R(T)/T = \widetilde O(1/T)$, and also reaching consensus on the highest-mean action within $\widetilde O(\sqrt{n})$ rounds. We obtain these results by showing a new connection between the global evolution of these decentralized protocols and a class of zero-sum multiplicative weights update} processes. Using this connection, we establish a general framework for analyzing the population-level regret and other properties of our protocols. Finally, we show our protocols are also surprisingly robust to adversarial rewards, and in this regime we obtain sublinear regret scaling like $R(T)/T = \widetilde O(1/\sqrt{T})$ as long as the number of rounds does not grow too fast as a function of $n$.
翻译:我们研究分布式GOSSIP模型中的协作多智能体赌博机问题:在每一轮中,$n$个智能体从公共集合中选择一个动作,观察该动作对应的奖励,随后与随机选择的单个邻居交换信息,这些信息可能影响其下一轮的选择。受GOSSIP模型中其他算法任务中广泛研究的观点动力学启发,我们为此场景设计并分析了一系列无记忆且时间无关的协议。针对平稳奖励环境,我们首次证明这些简单协议展现出两全其美的特性:既实现常数级累积遗憾缩放$R(T)/T = \widetilde O(1/T)$,同时在$\widetilde O(\sqrt{n})$轮内就最高均值动作达成共识。这些结论的获得源于我们发现了这些去中心化协议的全局演化与一类零和乘法权重更新过程之间的新联系。基于该联系,我们建立了分析协议群体层面遗憾及其他性质的一般框架。最后,我们证明这些协议对对抗性奖励具有惊人的鲁棒性:在此机制下,只要轮数随$n$的增长速度不过快,即可获得$R(T)/T = \widetilde O(1/\sqrt{T})$的次线性遗憾缩放。