In this paper, we formulate the multi-agent graph bandit problem as a multi-agent extension of the graph bandit problem introduced by Zhang, Johansson, and Li [CISS 57, 1-6 (2023)]. In our formulation, $N$ cooperative agents travel on a connected graph $G$ with $K$ nodes. Upon arrival at each node, agents observe a random reward drawn from a node-dependent probability distribution. The reward of the system is modeled as a weighted sum of the rewards the agents observe, where the weights capture some transformation of the reward associated with multiple agents sampling the same node at the same time. We propose an Upper Confidence Bound (UCB)-based learning algorithm, Multi-G-UCB, and prove that its expected regret over $T$ steps is bounded by $O(\gamma N\log(T)[\sqrt{KT} + DK])$, where $D$ is the diameter of graph $G$ and $\gamma$ a boundedness parameter associated with the weight functions. Lastly, we numerically test our algorithm by comparing it to alternative methods.
翻译:本文将以Zhang、Johansson和Li [CISS 57, 1-6 (2023)]提出的图赌博机问题为基础,将其推广为多智能体协作图赌博机问题。在该问题中,N个协作智能体在一个具有K个节点的连通图G上游走。每当智能体到达某个节点时,它们会观察到由节点相关的概率分布所生成的随机奖励。系统的奖励被建模为各智能体观测奖励的加权和,其中权重函数刻画了多个智能体同时采样同一节点时奖励的某种变换。我们提出了一种基于上置信界(UCB)的学习算法Multi-G-UCB,并证明其在T步内的期望遗憾值上界为O(γN log(T)[√(KT) + DK]),其中D为图G的直径,γ为与权重函数相关的有界性参数。最后,我们通过将算法与替代方法进行数值对比来验证其性能。