In this paper, we formulate the multi-agent graph bandit problem as a multi-agent extension of the graph bandit problem introduced by Zhang, Johansson, and Li [CISS 57, 1-6 (2023)]. In our formulation, $N$ cooperative agents travel on a connected graph $G$ with $K$ nodes. Upon arrival at each node, agents observe a random reward drawn from a node-dependent probability distribution. The reward of the system is modeled as a weighted sum of the rewards the agents observe, where the weights capture the decreasing marginal reward associated with multiple agents sampling the same node at the same time. We propose an Upper Confidence Bound (UCB)-based learning algorithm, Multi-G-UCB, and prove that its expected regret over $T$ steps is bounded by $O(N\log(T)[\sqrt{KT} + DK])$, where $D$ is the diameter of graph $G$. Lastly, we numerically test our algorithm by comparing it to alternative methods.
翻译:本文中,我们将多智能体图赌博机问题表述为Zhang、Johansson和Li [CISS 57, 1-6 (2023)] 提出的图赌博机问题的多智能体扩展。在本文的表述中,$N$个协作智能体在一个具有$K$个节点的连通图$G$上移动。到达每个节点时,智能体观察到从一个依赖于节点的概率分布中抽取的随机奖励。系统奖励被建模为智能体观测到的奖励的加权和,其中权重反映了多个智能体同时采样同一节点时对应的边际奖励递减效应。我们提出了一种基于置信上界(UCB)的学习算法Multi-G-UCB,并证明其在$T$步内的期望遗憾被$O(N\log(T)[\sqrt{KT} + DK])$所界,其中$D$是图$G$的直径。最后,我们通过将算法与替代方法进行比较来对其进行数值测试。