We study a structured multi-agent multi-armed bandit (MAMAB) problem in a dynamic environment. A graph reflects the information-sharing structure among agents, and the arms' reward distributions are piecewise-stationary with several unknown change points. The agents face the identical piecewise-stationary MAB problem. The goal is to develop a decision-making policy for the agents that minimizes the regret, which is the expected total loss of not playing the optimal arm at each time step. Our proposed solution, Restarted Bayesian Online Change Point Detection in Cooperative Upper Confidence Bound Algorithm (RBO-Coop-UCB), involves an efficient multi-agent UCB algorithm as its core enhanced with a Bayesian change point detector. We also develop a simple restart decision cooperation that improves decision-making. Theoretically, we establish that the expected group regret of RBO-Coop-UCB is upper bounded by $\mathcal{O}(KNM\log T + K\sqrt{MT\log T})$, where K is the number of agents, M is the number of arms, and T is the number of time steps. Numerical experiments on synthetic and real-world datasets demonstrate that our proposed method outperforms the state-of-the-art algorithms.
翻译:我们研究了动态环境下结构化的多智能体多臂赌博机(MAMAB)问题。图结构反映了智能体之间的信息共享机制,各臂的奖励分布在多个未知变化点处呈分段平稳特征。智能体面临相同的分段平稳MAB问题。研究目标是为智能体制定决策策略,以最小化遗憾值——即因未能在每个时间步选择最优臂而导致的期望总损失。我们提出的解决方案——重启贝叶斯在线变化点检测协同上置信界算法(RBO-Coop-UCB),采用高效的多智能体UCB算法作为核心,并融入贝叶斯变化点检测器进行增强。此外,我们设计了一种简单的重启决策协作机制以优化决策过程。理论上,我们证明RBO-Coop-UCB的期望群体遗憾值上界为$\mathcal{O}(KNM\log T + K\sqrt{MT\log T})$,其中K为智能体数量,M为臂数,T为时间步数。在合成数据集和真实数据集上的数值实验表明,所提方法优于现有最优算法。