This paper studies a cooperative multi-agent multi-armed stochastic bandit problem where agents operate asynchronously -- agent pull times and rates are unknown, irregular, and heterogeneous -- and face the same instance of a K-armed bandit problem. Agents can share reward information to speed up the learning process at additional communication costs. We propose ODC, an on-demand communication protocol that tailors the communication of each pair of agents based on their empirical pull times. ODC is efficient when the pull times of agents are highly heterogeneous, and its communication complexity depends on the empirical pull times of agents. ODC is a generic protocol that can be integrated into most cooperative bandit algorithms without degrading their performance. We then incorporate ODC into the natural extensions of UCB and AAE algorithms and propose two communication-efficient cooperative algorithms. Our analysis shows that both algorithms are near-optimal in regret.
翻译:本文研究一种协作式多智能体多臂随机老虎机问题,其中智能体异步运行——各智能体的拉杆时间与速率未知、不规律且异质——且面对同一K臂老虎机问题实例。智能体可通过共享奖励信息来加速学习过程,但需承担额外通信成本。我们提出按需通信协议ODC,该协议基于各智能体对经验拉杆时间为其通信量身定制。当智能体拉杆时间高度异质时,ODC效率显著,其通信复杂度由智能体经验拉杆时间决定。ODC是一种通用协议,可集成至大多数协作式老虎机算法中且不损害其性能。随后我们将ODC融入UCB与AAE算法的自然扩展形式,提出两种通信高效的协作算法。理论分析表明,两种算法在遗憾值上均达到近最优性能。