We study online learning in Bayesian Stackelberg games, where a leader repeatedly interacts with a follower whose unknown private type is independently drawn at each round from an unknown probability distribution. The goal is to design algorithms that minimize the leader's regret with respect to always playing an optimal commitment computed with knowledge of the game. We consider, for the first time to the best of our knowledge, the most realistic case in which the leader does not know anything about the follower's types, i.e., the possible follower payoffs. This raises considerable additional challenges compared to the commonly studied case in which the payoffs of follower types are known. First, we prove a strong negative result: no-regret is unattainable under action feedback, i.e., when the leader only observes the follower's best response at the end of each round. Thus, we focus on the easier type feedback model, where the follower's type is also revealed. In such a setting, we propose a no-regret algorithm that achieves a regret of $\widetilde{O}(\sqrt{T})$, when ignoring the dependence on other parameters.
翻译:本文研究贝叶斯斯塔克尔伯格博弈中的在线学习问题,其中领导者重复与追随者进行交互,而追随者未知的私有类型在每轮中从未知概率分布中独立抽取。目标是设计能够最小化领导者遗憾的算法,该遗憾相对于始终执行在已知博弈信息下计算出的最优承诺策略。据我们所知,我们首次考虑了最现实的情况:领导者对追随者类型(即可能的追随者收益)一无所知。与通常研究的追随者类型收益已知的情况相比,这带来了显著额外的挑战。首先,我们证明了一个强烈的负面结果:在行动反馈(即领导者仅每轮结束时观察追随者的最优响应)下无法实现无遗憾学习。因此,我们聚焦于更简单的类型反馈模型,其中追随者的类型也会被揭示。在此设定下,我们提出了一种无遗憾算法,当忽略对其他参数的依赖时,该算法可实现 $\widetilde{O}(\sqrt{T})$ 的遗憾界。