We consider regret minimization in a general collaborative multi-agent multi-armed bandit model, in which each agent faces a finite set of arms and may communicate with other agents through a central controller. The optimal arm for each agent in this model is the arm with the largest expected mixed reward, where the mixed reward of each arm is a weighted average of its rewards across all agents, making communication among agents crucial. While near-optimal sample complexities for best arm identification are known under this collaborative model, the question of optimal regret remains open. In this work, we address this problem and propose the first algorithm with order optimal regret bounds under this collaborative bandit model. Furthermore, we show that only a small constant number of expected communication rounds is needed.
翻译:本文研究一般性协作多智能体多臂赌博机模型中的遗憾最小化问题。在该模型中,每个智能体面对一组有限臂集,可通过中央控制器与其他智能体进行通信。每个智能体的最优臂定义为具有最大期望混合收益的臂,其中每根臂的混合收益为所有智能体对该臂收益的加权平均值,这使得智能体间的通信至关重要。尽管在该协作模型下,已有关于最优臂识别的近最优样本复杂度结果,但最优遗憾问题仍未解决。本文针对该问题展开研究,提出了首个在此协作赌博机模型下达到阶最优遗憾界的算法。此外,我们证明该算法仅需常数级期望通信轮次。