We present the problem of conservative distributed multi-task learning in stochastic linear contextual bandits with heterogeneous agents. This extends conservative linear bandits to a distributed setting where M agents tackle different but related tasks while adhering to stage-wise performance constraints. The exact context is unknown, and only a context distribution is available to the agents as in many practical applications that involve a prediction mechanism to infer context, such as stock market prediction and weather forecast. We propose a distributed upper confidence bound (UCB) algorithm, DiSC-UCB. Our algorithm constructs a pruned action set during each round to ensure the constraints are met. Additionally, it includes synchronized sharing of estimates among agents via a central server using well-structured synchronization steps. We prove the regret and communication bounds on the algorithm. We extend the problem to a setting where the agents are unaware of the baseline reward. For this setting, we provide a modified algorithm, DiSC-UCB2, and we show that the modified algorithm achieves the same regret and communication bounds. We empirically validated the performance of our algorithm on synthetic data and real-world Movielens-100K data.
翻译:我们提出了异构智能体在随机线性情境赌博机中进行保守分布式多任务学习的问题。这扩展了保守线性赌博机至分布式场景,其中M个智能体在满足阶段性性能约束的同时处理不同但相关的任务。由于实际应用中常涉及通过预测机制推断情境(如股市预测和天气预报),精确情境未知,智能体仅能获取情境分布信息。我们提出分布式上置信界算法DiSC-UCB。该算法在每轮构建剪枝动作集以确保满足约束,并通过结构化同步步骤实现智能体经中央服务器共享估计值。我们证明了该算法的遗憾界与通信界。进一步将问题扩展至智能体未知基线奖励的场景,为此提出改进算法DiSC-UCB2,证明其同样达到相同的遗憾与通信界。通过合成数据与真实Movielens-100K数据集的实验验证了算法性能。