We study the multi-agent multi-armed bandit (MAMAB) problem, where $m$ agents are factored into $\rho$ overlapping groups. Each group represents a hyperedge, forming a hypergraph over the agents. At each round of interaction, the learner pulls a joint arm (composed of individual arms for each agent) and receives a reward according to the hypergraph structure. Specifically, we assume there is a local reward for each hyperedge, and the reward of the joint arm is the sum of these local rewards. Previous work introduced the multi-agent Thompson sampling (MATS) algorithm \citep{verstraeten2020multiagent} and derived a Bayesian regret bound. However, it remains an open problem how to derive a frequentist regret bound for Thompson sampling in this multi-agent setting. To address these issues, we propose an efficient variant of MATS, the $\epsilon$-exploring Multi-Agent Thompson Sampling ($\epsilon$-MATS) algorithm, which performs MATS exploration with probability $\epsilon$ while adopts a greedy policy otherwise. We prove that $\epsilon$-MATS achieves a worst-case frequentist regret bound that is sublinear in both the time horizon and the local arm size. We also derive a lower bound for this setting, which implies our frequentist regret upper bound is optimal up to constant and logarithm terms, when the hypergraph is sufficiently sparse. Thorough experiments on standard MAMAB problems demonstrate the superior performance and the improved computational efficiency of $\epsilon$-MATS compared with existing algorithms in the same setting.
翻译:我们研究多智能体多臂赌博机(MAMAB)问题,其中$m$个智能体被分解为$\rho$个重叠的组。每个组代表一条超边,在智能体上构成一个超图。在每一轮交互中,学习者选择一个联合臂(由每个智能体的个体臂组成),并根据超图结构获得奖励。具体而言,我们假设每条超边对应一个局部奖励,而联合臂的总奖励为这些局部奖励之和。先前的工作提出了多智能体汤普森采样(MATS)算法\citep{verstraeten2020multiagent}并推导了其贝叶斯后悔界。然而,如何在此多智能体设定下推导汤普森采样的频率后悔界仍是一个开放问题。为解决这些问题,我们提出MATS的一种高效变体——$\epsilon$-探索多智能体汤普森采样($\epsilon$-MATS)算法,该算法以概率$\epsilon$执行MATS探索,否则采用贪心策略。我们证明$\epsilon$-MATS能够达到最坏情况下的频率后悔界,该后悔界在时间范围和局部臂规模上均为次线性。我们还推导了该设定下的下界,这表明当超图足够稀疏时,我们的频率后悔上界在常数项和对数项上是最优的。在标准MAMAB问题上的充分实验表明,与同类现有算法相比,$\epsilon$-MATS具有更优的性能和更高的计算效率。