In a multi-follower Bayesian Stackelberg game, a leader plays a mixed strategy over $L$ actions to which $n\ge 1$ followers, each having one of $K$ possible private types, best respond. The leader's optimal strategy depends on the distribution of the followers' private types. We study an online learning version of this problem: a leader interacts for $T$ rounds with $n$ followers with types sampled from an unknown distribution every round. The leader's goal is to minimize regret, defined as the difference between the cumulative utility of the optimal strategy and that of the actually chosen strategies. We design learning algorithms for the leader under different feedback settings. Under type feedback, where the leader observes the followers' types after each round, we design algorithms that achieve $O\big(\sqrt{\min(L\log(nKA T), nK ) \cdot T} \big)$ regret for independent type distributions and $O\big(\sqrt{\min(L\log(nKA T), K^n ) \cdot T} \big)$ regret for general type distributions. Interestingly, those bounds do not grow with $n$ at a polynomial rate. Under action feedback, where the leader only observes the followers' actions, we design algorithms with $O( \min(\sqrt{ n^L K^L A^{2L} L T \log T}, K^n\sqrt{ T } \log T ) )$ regret. We also provide a lower bound of $Ω(\sqrt{\min(L, nK)T})$, almost matching the type-feedback upper bounds.
翻译:在多跟随者贝叶斯斯塔克尔伯格博弈中,领导者采用$L$个行动上的混合策略,$n\ge 1$个跟随者(各自拥有$K$种可能的私有类型之一)对此做出最优响应。领导者的最优策略取决于跟随者私有类型的分布。我们研究该问题的在线学习版本:领导者在$T$轮中与$n$个跟随者交互,每轮跟随者的类型从未知分布中采样。领导者的目标是最小化遗憾,其定义为最优策略的累积效用与实际选择策略的累积效用之差。我们在不同反馈设置下为领导者设计了学习算法。在类型反馈(即每轮后领导者能观察到跟随者类型)下,我们设计的算法对独立类型分布实现了$O\big(\sqrt{\min(L\log(nKA T), nK ) \cdot T} \big)$遗憾,对一般类型分布实现了$O\big(\sqrt{\min(L\log(nKA T), K^n ) \cdot T} \big)$遗憾。值得注意的是,这些上界并不以多项式速率随$n$增长。在行动反馈(即领导者仅能观察到跟随者行动)下,我们设计的算法实现了$O( \min(\sqrt{ n^L K^L A^{2L} L T \log T}, K^n\sqrt{ T } \log T ) )$遗憾。我们还给出了$Ω(\sqrt{\min(L, nK)T})$的下界,该下界几乎匹配类型反馈的上界。