We study the regret in stochastic Multi-Armed Bandits (MAB) with multiple agents that communicate over an arbitrary connected communication graph. We analyzed a variant of Cooperative Successive Elimination algorithm, COOP-SE, and show an individual regret bound of $O(R/ m + A^2 + A \sqrt{\log T})$ and a nearly matching lower bound. Here $A$ is the number of actions, $T$ the time horizon, $m$ the number of agents, and $R = \sum_{Δ_i > 0}\log(T)/Δ_i$ is the optimal single agent regret, where $Δ_i$ is the sub-optimality gap of action $i$. Our work is the first to show an individual regret bound in cooperative stochastic MAB that is independent of the graph's diameter. When considering communication networks there are additional considerations beyond regret, such as message size and number of communication rounds. First, we show that our regret bound holds even if we restrict the messages to be of logarithmic size. Second, for logarithmic number of communication rounds, we obtain a regret bound of $O(R / m+A \log T)$.
翻译:本研究探讨了在任意连通通信图上进行通信的多智能体随机多臂赌博机(MAB)中的遗憾问题。我们分析了合作连续消除算法(COOP-SE)的一个变体,并证明了其个体遗憾上界为 $O(R/ m + A^2 + A \sqrt{\log T})$,同时给出了一个近乎匹配的下界。其中 $A$ 表示动作数量,$T$ 为时间范围,$m$ 为智能体数量,$R = \sum_{Δ_i > 0}\log(T)/Δ_i$ 为最优单智能体遗憾($Δ_i$ 表示动作 $i$ 的次优间隙)。我们的工作首次在合作随机MAB中提出了与图直径无关的个体遗憾界。在考虑通信网络时,除遗憾外还需关注消息大小和通信轮数等额外因素。首先,我们证明即使将消息限制为对数规模,该遗憾界依然成立。其次,在对数级通信轮数条件下,我们获得了 $O(R / m+A \log T)$ 的遗憾界。