We study the problem of federated stochastic multi-arm contextual bandits with unknown contexts, in which M agents are faced with different bandits and collaborate to learn. The communication model consists of a central server and the agents share their estimates with the central server periodically to learn to choose optimal actions in order to minimize the total regret. We assume that the exact contexts are not observable and the agents observe only a distribution of the contexts. Such a situation arises, for instance, when the context itself is a noisy measurement or based on a prediction mechanism. Our goal is to develop a distributed and federated algorithm that facilitates collaborative learning among the agents to select a sequence of optimal actions so as to maximize the cumulative reward. By performing a feature vector transformation, we propose an elimination-based algorithm and prove the regret bound for linearly parametrized reward functions. Finally, we validated the performance of our algorithm and compared it with another baseline approach using numerical simulations on synthetic data and on the real-world movielens dataset.
翻译:我们研究了具有未知上下文的联邦随机多臂上下文带状学习问题,其中M个智能体面临不同的带状环境并通过协作进行学习。通信模型包含一个中心服务器,智能体定期与中心服务器共享其估计值,以学习选择最优动作从而最小化总遗憾。我们假设精确上下文不可观测,智能体仅能观测到上下文的分布。这种情况可能发生在上下文本身是噪声测量或基于预测机制时。我们的目标是开发一种分布式联邦算法,促进智能体间的协作学习,使其能够选择一系列最优动作以最大化累积奖励。通过特征向量变换,我们提出了一种基于淘汰的算法,并证明了线性参数化奖励函数的遗憾上界。最后,我们在合成数据和真实世界MovieLens数据集上通过数值模拟验证了算法性能,并与另一种基线方法进行了比较。