We address two-player general-sum stochastic Stackelberg games (SSGs), where the leader's policy is optimized considering the best-response follower whose policy is optimal for its reward under the leader. Existing policy gradient and value iteration approaches for SSGs do not guarantee monotone improvement in the leader's policy under the best-response follower. Consequently, their performance is not guaranteed when their limits are not stationary Stackelberg equilibria (SSEs), which do not necessarily exist. In this paper, we derive a policy improvement theorem for SSGs under the best-response follower and propose a novel policy iteration algorithm that guarantees monotone improvement in the leader's performance. Additionally, we introduce Pareto-optimality as an extended optimality of the SSE and prove that our method converges to the Pareto front when the leader is myopic.
翻译:本文研究双玩家一般和随机斯塔克尔伯格博弈(SSGs),其中领导者的策略需在考虑追随者最优反应的情况下进行优化——追随者策略是在领导者策略下使其自身奖励最优的策略。现有针对SSGs的策略梯度和值迭代方法无法保证在最优反应追随者条件下领导者策略的单调改进。因此,当这些方法的极限不是静态斯塔克尔伯格均衡(SSEs)时,其性能无法得到保证,而SSE本身未必存在。本文推导了最优反应追随者条件下SSGs的策略改进定理,并提出一种新颖的策略迭代算法,该算法能保证领导者性能的单调改进。此外,我们引入帕累托最优性作为SSE的扩展最优性概念,并证明当领导者具有短视特性时,所提方法收敛于帕累托前沿。