We study a distributed multi-armed bandit setting among a population of $n$ memory-constrained nodes in the gossip model: at each round, every node locally adopts one of $m$ arms, observes a reward drawn from the arm's (adversarially chosen) distribution, and then communicates with a randomly sampled neighbor, exchanging information to determine its policy in the next round. We introduce and analyze several families of dynamics for this task that are decentralized: each node's decision is entirely local and depends only on its most recently obtained reward and that of the neighbor it sampled. We show a connection between the global evolution of these decentralized dynamics with a certain class of "zero-sum" multiplicative weight update algorithms, and we develop a general framework for analyzing the population-level regret of these natural protocols. Using this framework, we derive sublinear regret bounds under a wide range of parameter regimes (i.e., the size of the population and number of arms) for both the stationary reward setting (where the mean of each arm's distribution is fixed over time) and the adversarial reward setting (where means can vary over time). Further, we show that these protocols can approximately optimize convex functions over the simplex when the reward distributions are generated from a stochastic gradient oracle.
翻译:我们研究在八卦模型下,由$n$个内存受限节点组成的群体中的分布式多臂老虎机问题:每轮中,每个节点本地选择$m$个臂中的一个,观察从该臂(由对抗方选择)的分布中产生的奖励,然后与随机采样的邻居通信,交换信息以确定其下一轮的策略。我们针对该任务引入并分析了多类去中心化动力学:每个节点的决策完全是局部的,仅取决于其最近获得的奖励以及其所采样邻居的奖励。我们证明了这些去中心化动力学的全局演化与一类“零和”乘性权重更新算法之间存在联系,并开发了一个通用框架,用于分析这些自然协议的群体级遗憾。利用该框架,我们在广泛的参数范围(即群体规模和臂的数量)下,针对平稳奖励设置(每个臂分布的均值随时间固定)和对抗奖励设置(均值可随时间变化)推导出了次线性遗憾界。此外,我们证明当奖励分布由随机梯度预言机生成时,这些协议可以在单纯形上近似优化凸函数。