In this paper, we study kernelized bandits with distributed biased feedback. This problem is motivated by several real-world applications (such as dynamic pricing, cellular network configuration, and policy making), where users from a large population contribute to the reward of the action chosen by a central entity, but it is difficult to collect feedback from all users. Instead, only biased feedback (due to user heterogeneity) from a subset of users may be available. In addition to such partial biased feedback, we are also faced with two practical challenges due to communication cost and computation complexity. To tackle these challenges, we carefully design a new \emph{distributed phase-then-batch-based elimination (\texttt{DPBE})} algorithm, which samples users in phases for collecting feedback to reduce the bias and employs \emph{maximum variance reduction} to select actions in batches within each phase. By properly choosing the phase length, the batch size, and the confidence width used for eliminating suboptimal actions, we show that \texttt{DPBE} achieves a sublinear regret of $\tilde{O}(T^{1-\alpha/2}+\sqrt{\gamma_T T})$, where $\alpha\in (0,1)$ is the user-sampling parameter one can tune. Moreover, \texttt{DPBE} can significantly reduce both communication cost and computation complexity in distributed kernelized bandits, compared to some variants of the state-of-the-art algorithms (originally developed for standard kernelized bandits). Furthermore, by incorporating various \emph{differential privacy} models (including the central, local, and shuffle models), we generalize \texttt{DPBE} to provide privacy guarantees for users participating in the distributed learning process. Finally, we conduct extensive simulations to validate our theoretical results and evaluate the empirical performance.
翻译:本文研究了分布式偏置反馈下的核化赌博机问题。该问题的动机源于动态定价、蜂窝网络配置及政策制定等现实应用场景:在这些场景中,大量用户群体对中央实体所选动作的收益产生贡献,但收集所有用户的反馈存在困难,只能获取部分用户(因异质性导致的)偏置反馈。除这种局部偏置反馈外,我们还面临通信成本与计算复杂度两大实际挑战。为应对这些挑战,我们精心设计了一种新型的《分布式阶段-批处理消除算法》(DPBE)。该算法通过分阶段采样用户收集反馈以降低偏置,并在每个阶段内采用《最大方差缩减》策略批量选择动作。通过合理选择阶段长度、批处理大小及用于消除次优动作的置信区间宽度,我们证明DPBE可实现 $\tilde{O}(T^{1-\alpha/2}+\sqrt{\gamma_T T})$ 的次线性遗憾值,其中 $\alpha\in (0,1)$ 是可调用户采样参数。与现有最优算法(原始针对标准核化赌博机设计)的变体相比,DPBE能显著降低分布式核化赌博机中的通信成本与计算复杂度。此外,通过引入《差分隐私》模型(包括中央、局部及混洗模型),我们进一步泛化DPBE以保障参与分布式学习过程的用户隐私。最后,通过大规模仿真实验验证了理论结果并评估了实际性能。