Actions Speak What You Want: Provably Sample-Efficient Reinforcement Learning of the Quantal Stackelberg Equilibrium from Strategic Feedbacks

We study reinforcement learning (RL) for learning a Quantal Stackelberg Equilibrium (QSE) in an episodic Markov game with a leader-follower structure. In specific, at the outset of the game, the leader announces her policy to the follower and commits to it. The follower observes the leader's policy and, in turn, adopts a quantal response policy by solving an entropy-regularized policy optimization problem induced by leader's policy. The goal of the leader is to find her optimal policy, which yields the optimal expected total return, by interacting with the follower and learning from data. A key challenge of this problem is that the leader cannot observe the follower's reward, and needs to infer the follower's quantal response model from his actions against leader's policies. We propose sample-efficient algorithms for both the online and offline settings, in the context of function approximation. Our algorithms are based on (i) learning the quantal response model via maximum likelihood estimation and (ii) model-free or model-based RL for solving the leader's decision making problem, and we show that they achieve sublinear regret upper bounds. Moreover, we quantify the uncertainty of these estimators and leverage the uncertainty to implement optimistic and pessimistic algorithms for online and offline settings. Besides, when specialized to the linear and myopic setting, our algorithms are also computationally efficient. Our theoretical analysis features a novel performance-difference lemma which incorporates the error of quantal response model, which might be of independent interest.

翻译：我们研究在具有领导者-跟随者结构的回合制马尔可夫博弈中学习量纳什斯塔克尔伯格均衡（QSE）的强化学习（RL）问题。具体而言，在博弈开始时，领导者向跟随者公布其策略并承诺执行该策略。跟随者观察到领导者的策略后，通过求解由领导者策略诱导的熵正则化策略优化问题，采用量反应策略。领导者的目标是通过与跟随者交互并从数据中学习，找到能产生最优期望总回报的最优策略。该问题的关键挑战在于领导者无法观测跟随者的奖励，只能从跟随者针对领导者策略所采取的行动中推断其量反应模型。我们在函数近似框架下，针对在线和离线两种设置提出了样本高效的算法。我们的算法基于：（i）通过最大似然估计学习量反应模型；（ii）采用无模型或基于模型的RL解决领导者决策问题，并证明了这些算法具有次线性遗憾上界。此外，我们量化了这些估计量的不确定性，并利用该不确定性实现在线和离线设置下的乐观与悲观算法。当算法特化为线性且短视场景时，其计算效率同样得到保证。我们的理论分析提出了一种考虑量反应模型误差的新型性能差异引理，该引理可能具有独立的研究兴趣。