We study a new non-stochastic federated multi-armed bandit problem with multiple agents collaborating via a communication network. The losses of the arms are assigned by an oblivious adversary that specifies the loss of each arm not only for each time step but also for each agent, which we call ``doubly adversarial". In this setting, different agents may choose the same arm in the same time step but observe different feedback. The goal of each agent is to find a globally best arm in hindsight that has the lowest cumulative loss averaged over all agents, which necessities the communication among agents. We provide regret lower bounds for any federated bandit algorithm under different settings, when agents have access to full-information feedback, or the bandit feedback. For the bandit feedback setting, we propose a near-optimal federated bandit algorithm called FEDEXP3. Our algorithm gives a positive answer to an open question proposed in Cesa-Bianchi et al. (2016): FEDEXP3 can guarantee a sub-linear regret without exchanging sequences of selected arm identities or loss sequences among agents. We also provide numerical evaluations of our algorithm to validate our theoretical results and demonstrate its effectiveness on synthetic and real-world datasets
翻译:我们研究了一种新的非随机联邦多臂赌博机问题,其中多个智能体通过通信网络进行协作。臂的损失由一个不记仇的对手分配,该对手不仅为每个时间步指定每个臂的损失,还为每个智能体指定损失,我们称之为“对偶对抗”。在此设定下,不同智能体可能在相同时间步选择相同臂,但观察到不同的反馈。每个智能体的目标是事后找到一个全局最优的臂,该臂具有所有智能体平均的最优累积损失,这要求智能体之间进行通信。我们提供了在不同设定下任何联邦赌博机算法的遗憾下界,这些设定包括智能体可获取完全信息反馈或赌博机反馈。针对赌博机反馈设定,我们提出了一种近乎最优的联邦赌博机算法,称为FEDEXP3。我们的算法对Cesa-Bianchi等人(2016)提出的一个开放问题给出了肯定回答:FEDEXP3可以保证在无需智能体之间交换所选臂身份序列或损失序列的情况下实现次线性遗憾。我们还对算法进行了数值评估,以验证我们的理论结果,并展示了其在合成和真实世界数据集上的有效性。