In general-sum stochastic games, a stationary Stackelberg equilibrium (SSE) does not always exist, in which the leader maximizes leader's return for all the initial states when the follower takes the best response against the leader's policy. Existing methods of determining the SSEs require strong assumptions to guarantee the convergence and the coincidence of the limit with the SSE. Moreover, our analysis suggests that the performance at the fixed points of these methods is not reasonable when they are not SSEs. Herein, we introduced the concept of Pareto-optimality as a reasonable alternative to SSEs. We derive the policy improvement theorem for stochastic games with the best-response follower and propose an iterative algorithm to determine the Pareto-optimal policies based on it. Monotone improvement and convergence of the proposed approach are proved, and its convergence to SSEs is proved in a special case.
翻译:在一般和随机博弈中,平稳Stackelberg均衡(SSE)并非总是存在,该均衡要求当追随者针对领导者策略采取最优响应时,领导者对所有初始状态的收益实现最大化。现有确定SSE的方法需要强假设才能保证收敛性及其极限与SSE的一致性。此外,我们的分析表明,当这些方法得到的固定点并非SSE时,其性能表现并不合理。为此,我们引入帕累托最优性概念作为SSE的合理替代方案。针对具有最优响应追随者的随机博弈,推导了策略改进定理,并基于该定理提出了一种迭代算法以确定帕累托最优策略。证明了所提方法具有单调改进性和收敛性,并在特殊情况下证明了其收敛到SSE。