We consider a multi-agent multi-armed bandit setting in which $n$ honest agents collaborate over a network to minimize regret but $m$ malicious agents can disrupt learning arbitrarily. Assuming the network is the complete graph, existing algorithms incur $O( (m + K/n) \log (T) / \Delta )$ regret in this setting, where $K$ is the number of arms and $\Delta$ is the arm gap. For $m \ll K$, this improves over the single-agent baseline regret of $O(K\log(T)/\Delta)$. In this work, we show the situation is murkier beyond the case of a complete graph. In particular, we prove that if the state-of-the-art algorithm is used on the undirected line graph, honest agents can suffer (nearly) linear regret until time is doubly exponential in $K$ and $n$. In light of this negative result, we propose a new algorithm for which the $i$-th agent has regret $O( ( d_{\text{mal}}(i) + K/n) \log(T)/\Delta)$ on any connected and undirected graph, where $d_{\text{mal}}(i)$ is the number of $i$'s neighbors who are malicious. Thus, we generalize existing regret bounds beyond the complete graph (where $d_{\text{mal}}(i) = m$), and show the effect of malicious agents is entirely local (in the sense that only the $d_{\text{mal}}(i)$ malicious agents directly connected to $i$ affect its long-term regret).
翻译:我们考虑一个多智能体多臂赌博机场景:其中 \(n\) 个诚实智能体通过网络协作以最小化累积遗憾,而 \(m\) 个恶意智能体可任意干扰学习过程。在假设网络为完全图的情况下,现有算法在此场景中的遗憾界为 \(O\left( (m + K/n) \log(T) / \Delta \right)\),其中 \(K\) 表示臂数,\(\Delta\) 表示臂间差距。当 \(m \ll K\) 时,该结果优于单智能体基准场景下的遗憾界 \(O(K\log(T)/\Delta)\)。本工作表明,在完全图之外的情形更为复杂。特别地,我们证明若将当前最优算法应用于无向线图,诚实智能体在时间达到 \(K\) 和 \(n\) 的双重指数级之前,将遭受(几乎)线性遗憾。基于这一负面结论,我们提出一种新算法,使得在任意连通无向图中,第 \(i\) 个智能体的遗憾为 \(O\left( ( d_{\text{mal}}(i) + K/n) \log(T)/\Delta \right)\),其中 \(d_{\text{mal}}(i)\) 是智能体 \(i\) 的邻居中恶意智能体的数量。由此,我们将现有遗憾界推广至完全图(其中 \(d_{\text{mal}}(i) = m\))之外,并揭示恶意智能体的影响完全具有局域性:即仅与智能体 \(i\) 直接相连的 \(d_{\text{mal}}(i)\) 个恶意智能体会影响其长期遗憾。