We study the regret in stochastic Multi-Armed Bandits (MAB) with multiple agents that communicate over an arbitrary connected communication graph. We analyzed a variant of Cooperative Successive Elimination algorithm, $\coopse$, and show an individual regret bound of ${O}(\mathcal{R} / m + A^2 + A \sqrt{\log T})$ and a nearly matching lower bound. Here $A$ is the number of actions, $T$ the time horizon, $m$ the number of agents, and $\mathcal{R} = \sum_{Δ_i > 0}\log(T)/Δ_i$ is the optimal single agent regret, where $Δ_i$ is the sub-optimality gap of action $i$. Our work is the first to show an individual regret bound in cooperative stochastic MAB that is independent of the graph's diameter. When considering communication networks there are additional considerations beyond regret, such as message size and number of communication rounds. First, we show that our regret bound holds even if we restrict the messages to be of logarithmic size. Second, for logarithmic number of communication rounds, we obtain a regret bound of ${O}(\mathcal{R} / m+A \log T)$.
翻译:我们研究了在任意连通通信图上进行通信的多智能体随机多臂老虎机(MAB)中的遗憾问题。我们分析了合作连续消除算法 $\coopse$ 的一个变体,并证明了其个体遗憾上界为 ${O}(\mathcal{R} / m + A^2 + A \sqrt{\log T})$,同时给出了一个近乎匹配的下界。其中 $A$ 表示动作数量,$T$ 表示时间范围,$m$ 表示智能体数量,$\mathcal{R} = \sum_{Δ_i > 0}\log(T)/Δ_i$ 为最优单智能体遗憾,$Δ_i$ 表示动作 $i$ 的次优间隙。我们的工作首次在合作随机 MAB 中证明了与图直径无关的个体遗憾上界。在考虑通信网络时,除了遗憾之外还需考虑其他因素,例如消息大小和通信轮数。首先,我们证明即使将消息大小限制为对数级别,我们的遗憾上界依然成立。其次,对于对数级别的通信轮数,我们得到了 ${O}(\mathcal{R} / m+A \log T)$ 的遗憾上界。