Multi-armed bandit algorithms provide solutions for sequential decision-making where learning takes place by interacting with the environment. In this work, we model a distributed optimization problem as a multi-agent kernelized multi-armed bandit problem with a heterogeneous reward setting. In this setup, the agents collaboratively aim to maximize a global objective function which is an average of local objective functions. The agents can access only bandit feedback (noisy reward) obtained from the associated unknown local function with a small norm in reproducing kernel Hilbert space (RKHS). We present a fully decentralized algorithm, Multi-agent IGP-UCB (MA-IGP-UCB), which achieves a sub-linear regret bound for popular classes for kernels while preserving privacy. It does not necessitate the agents to share their actions, rewards, or estimates of their local function. In the proposed approach, the agents sample their individual local functions in a way that benefits the whole network by utilizing a running consensus to estimate the upper confidence bound on the global function. Furthermore, we propose an extension, Multi-agent Delayed IGP-UCB (MAD-IGP-UCB) algorithm, which reduces the dependence of the regret bound on the number of agents in the network. It provides improved performance by utilizing a delay in the estimation update step at the cost of more communication.
翻译:多臂赌博机算法为通过与环境的交互进行学习的序贯决策问题提供了解决方案。本文将分布式优化问题建模为具有异质奖励设置的多智能体核化多臂赌博机问题。在该设定中,智能体协同优化一个全局目标函数——该函数由局部目标函数的平均值构成。智能体仅能获取与再生核希尔伯特空间中具有小范数的未知局部函数相关的赌博机反馈(带噪奖励)。我们提出了一种完全去中心化的算法——多智能体IGP-UCB(MA-IGP-UCB),该算法在保护隐私的同时,针对常见核函数类别实现了次线性遗憾界。该算法无需智能体共享其动作、奖励或局部函数估计值。在所提方法中,智能体通过利用运行共识机制估计全局函数的上置信界,以有益于整个网络的方式采样其个体局部函数。此外,我们进一步提出了扩展版本——多智能体延迟IGP-UCB(MAD-IGP-UCB)算法,该算法通过降低遗憾界对网络中智能体数量的依赖性,以增加通信为代价在估计更新步骤中引入延迟,从而提升性能。