We study a decentralized multi-agent multi-armed bandit problem in which multiple clients are connected by time dependent random graphs provided by an environment. The reward distributions of each arm vary across clients and rewards are generated independently over time by an environment based on distributions that include both sub-exponential and sub-gaussian distributions. Each client pulls an arm and communicates with neighbors based on the graph provided by the environment. The goal is to minimize the overall regret of the entire system through collaborations. To this end, we introduce a novel algorithmic framework, which first provides robust simulation methods for generating random graphs using rapidly mixing Markov chains or the random graph model, and then combines an averaging-based consensus approach with a newly proposed weighting technique and the upper confidence bound to deliver a UCB-type solution. Our algorithms account for the randomness in the graphs, removing the conventional doubly stochasticity assumption, and only require the knowledge of the number of clients at initialization. We derive optimal instance-dependent regret upper bounds of order $\log{T}$ in both sub-gaussian and sub-exponential environments, and a nearly optimal mean-gap independent regret upper bound of order $\sqrt{T}\log T$ up to a $\log T$ factor. Importantly, our regret bounds hold with high probability and capture graph randomness, whereas prior works consider expected regret under assumptions and require more stringent reward distributions.
翻译:摘要:本文研究了一个去中心化多智能体多臂老虎机问题,其中多个客户端通过环境提供的时变随机图相互连接。每个臂的奖励分布在客户端间存在差异,且奖励由环境基于包含次指数分布和次高斯分布的分布族独立生成。每个客户端根据环境提供的图结构选择拉取某个臂,并与邻居进行通信。目标是通过协作最小化整个系统的总遗憾。为此,我们提出了一种新型算法框架:首先利用快速混合马尔可夫链或随机图模型生成随机图的鲁棒仿真方法,随后将基于平均的共识方法与新提出的加权技术及上置信界相结合,得到UCB型解决方案。我们的算法考虑了图的随机性,摒弃了传统的双随机性假设,且仅需在初始化时获知客户端数量。在次高斯和次指数环境下,我们推导出阶为$\log{T}$的最优实例依赖遗憾上界,以及阶为$\sqrt{T}\log T$(至多相差一个$\log T$因子)的近乎最优均值间隙无关遗憾上界。关键的是,我们的遗憾界以高概率成立并刻画了图的随机性,而先前工作多基于假设条件考虑期望遗憾,且要求更严格的奖励分布。