Recent works on neural contextual bandits have achieved compelling performances due to their ability to leverage the strong representation power of neural networks (NNs) for reward prediction. Many applications of contextual bandits involve multiple agents who collaborate without sharing raw observations, thus giving rise to the setting of federated contextual bandits. Existing works on federated contextual bandits rely on linear or kernelized bandits, which may fall short when modeling complex real-world reward functions. So, this paper introduces the federated neural-upper confidence bound (FN-UCB) algorithm. To better exploit the federated setting, FN-UCB adopts a weighted combination of two UCBs: $\text{UCB}^{a}$ allows every agent to additionally use the observations from the other agents to accelerate exploration (without sharing raw observations), while $\text{UCB}^{b}$ uses an NN with aggregated parameters for reward prediction in a similar way to federated averaging for supervised learning. Notably, the weight between the two UCBs required by our theoretical analysis is amenable to an interesting interpretation, which emphasizes $\text{UCB}^{a}$ initially for accelerated exploration and relies more on $\text{UCB}^{b}$ later after enough observations have been collected to train the NNs for accurate reward prediction (i.e., reliable exploitation). We prove sub-linear upper bounds on both the cumulative regret and the number of communication rounds of FN-UCB, and empirically demonstrate its competitive performance.
翻译:近年来的神经上下文乐透机研究因利用神经网络强大的表示能力进行奖励预测而取得了显著效果。上下文乐透机的许多应用涉及多个智能体在不共享原始观测数据的情况下协作,从而催生了联邦上下文乐透机这一设定。现有联邦上下文乐透机工作依赖于线性或核化乐透机,这在建模复杂现实奖励函数时可能表现不足。为此,本文提出联邦神经-上置信界(FN-UCB)算法。为更好地利用联邦设定,FN-UCB采用两种UCB的加权组合:$\text{UCB}^{a}$允许每个智能体额外使用其他智能体的观测数据加速探索(无需共享原始观测),而$\text{UCB}^{b}$则采用参数聚合的神经网络进行奖励预测,其方式类似于监督学习中的联邦平均。值得注意的是,理论分析所需的两种UCB权重具有有趣的解释:初始阶段强调$\text{UCB}^{a}$以加速探索,在收集足够观测数据训练神经网络实现准确奖励预测后(即可靠利用),则更多依赖$\text{UCB}^{b}$。我们证明了FN-UCB的累积遗憾和通信轮数均具有次线性上界,并通过实验展示了其竞争性性能。