Reward Teaching for Federated Multi-armed Bandits

Most of the existing federated multi-armed bandits (FMAB) designs are based on the presumption that clients will implement the specified design to collaborate with the server. In reality, however, it may not be possible to modify the clients' existing protocols. To address this challenge, this work focuses on clients who always maximize their individual cumulative rewards, and introduces a novel idea of ``reward teaching'', where the server guides the clients towards global optimality through implicit local reward adjustments. Under this framework, the server faces two tightly coupled tasks of bandit learning and target teaching, whose combination is non-trivial and challenging. A phased approach, called Teaching-After-Learning (TAL), is first designed to encourage and discourage clients' explorations separately. General performance analyses of TAL are established when the clients' strategies satisfy certain mild requirements. With novel technical approaches developed to analyze the warm-start behaviors of bandit algorithms, particularized guarantees of TAL with clients running UCB or epsilon-greedy strategies are then obtained. These results demonstrate that TAL achieves logarithmic regrets while only incurring logarithmic adjustment costs, which is order-optimal w.r.t. a natural lower bound. As a further extension, the Teaching-While-Learning (TWL) algorithm is developed with the idea of successive arm elimination to break the non-adaptive phase separation in TAL. Rigorous analyses demonstrate that when facing clients with UCB1, TWL outperforms TAL in terms of the dependencies on sub-optimality gaps thanks to its adaptive design. Experimental results demonstrate the effectiveness and generality of the proposed algorithms.

翻译：现有联邦多臂赌博机(FMAB)设计大多基于客户端将执行特定设计与服务器协作的假设。然而现实中，修改客户端现有协议往往不可行。为应对这一挑战，本文聚焦于始终追求个体累积奖励最大化的客户端，并提出"奖励教学"这一创新理念——服务器通过隐式局部奖励调整引导客户端实现全局最优性。在该框架下，服务器面临赌博机学习与目标教学这两个紧密耦合的任务，其组合具有显著非平凡性和挑战性。本文首先设计了一种分阶段方法"先学后教"(TAL)，分别促进和抑制客户端的探索行为。当客户端策略满足某些温和条件时，建立了TAL的通用性能分析框架。通过开发分析赌博机算法热启动行为的新技术手段，进一步获得了客户端采用UCB或ε-贪婪策略时TAL的特定性能保障。结果表明，TAL在实现对数级遗憾的同时仅产生对数级调整代价，这相对于自然下界达到了阶数最优性。作为扩展，本文基于逐轮臂淘汰思想提出"边学边教"(TWL)算法，突破了TAL中非自适应的阶段分离限制。严格分析证明，面对采用UCB1策略的客户端时，TWL凭借其自适应设计在次优间隙的依赖性方面优于TAL。实验结果验证了所提算法的有效性与普适性。