We study the problem of online learning in Stackelberg games with side information between a leader and a sequence of followers. In every round the leader observes contextual information and commits to a mixed strategy, after which the follower best-responds. We provide learning algorithms for the leader which achieve $O(T^{1/2})$ regret under bandit feedback, an improvement from the previously best-known rates of $O(T^{2/3})$. Our algorithms rely on a reduction to linear contextual bandits in the utility space: In each round, a linear contextual bandit algorithm recommends a utility vector, which our algorithm inverts to determine the leader's mixed strategy. We extend our algorithms to the setting in which the leader's utility function is unknown, and also apply it to the problems of bidding in second-price auctions with side information and online Bayesian persuasion with public and private states. Finally, we observe that our algorithms empirically outperform previous results on numerical simulations.
翻译:我们研究了领导者与一系列跟随者之间具有侧信息的斯塔克尔伯格博弈中的在线学习问题。在每一轮中,领导者观察上下文信息并承诺一个混合策略,随后跟随者做出最优响应。我们为领导者提供了学习算法,该算法在强盗反馈下实现了$O(T^{1/2})$的遗憾,较之前已知的最佳速率$O(T^{2/3})$有所提升。我们的算法依赖于在效用空间中对线性上下文强盗问题的归约:在每一轮中,一个线性上下文强盗算法推荐一个效用向量,我们的算法将其反演以确定领导者的混合策略。我们将算法扩展到领导者效用函数未知的场景,并将其应用于具有侧信息的第二价格拍卖竞价问题,以及具有公共和私有状态的在线贝叶斯劝说问题。最后,我们观察到我们的算法在数值模拟中实证上优于先前的结果。