We study the problem of best-arm identification in a distributed variant of the multi-armed bandit setting, with a central learner and multiple agents. Each agent is associated with an arm of the bandit, generating stochastic rewards following an unknown distribution. Further, each agent can communicate the observed rewards with the learner over a bit-constrained channel. We propose a novel quantization scheme called Inflating Confidence for Quantization (ICQ) that can be applied to existing confidence-bound based learning algorithms such as Successive Elimination. We analyze the performance of ICQ applied to Successive Elimination and show that the overall algorithm, named ICQ-SE, has the order-optimal sample complexity as that of the (unquantized) SE algorithm. Moreover, it requires only an exponentially sparse frequency of communication between the learner and the agents, thus requiring considerably fewer bits than existing quantization schemes to successfully identify the best arm. We validate the performance improvement offered by ICQ with other quantization methods through numerical experiments.
翻译:我们研究分布式多臂赌博机设置中的最优臂识别问题,其中包含一个中心学习器和多个智能体。每个智能体关联赌博机的一个臂,产生服从未知分布的随机奖励。此外,每个智能体可通过比特受限信道将观测到的奖励传输给学习器。我们提出一种名为“置信度膨胀量化(ICQ)”的新型量化方案,可应用于基于置信边界的现有学习算法(如逐次消除法)。我们分析了将ICQ应用于逐次消除法的性能,并证明整体算法ICQ-SE具有与(未量化)SE算法阶数最优的样本复杂度。此外,该算法仅需学习器与智能体之间呈指数稀疏的通信频率,因此相较现有量化方案,成功识别最优臂所需的比特数显著减少。通过数值实验验证了ICQ相比其他量化方法带来的性能提升。