Learning in Stackelberg Markov Games

Designing socially optimal policies in multi-agent environments is a fundamental challenge in both economics and artificial intelligence. This paper studies a general framework for learning Stackelberg equilibria in dynamic and uncertain environments, where a single leader interacts with a population of adaptive followers. Motivated by pressing real-world challenges such as equitable electricity tariff design for consumers with distributed energy resources (such as rooftop solar and energy storage), we formalize a class of Stackelberg Markov games and establish the existence and uniqueness of stationary Stackelberg equilibria under mild continuity and monotonicity conditions. We then extend the framework to incorporate a continuum of agents via mean-field approximation, yielding a tractable Stackelberg-Mean Field Equilibrium (S-MFE) formulation. To address the computational intractability of exact best-response dynamics, we introduce a softmax-based approximation and rigorously bound its error relative to the true Stackelberg equilibrium. Our approach enables scalable and stable learning through policy iteration without requiring full knowledge of follower objectives. We validate the framework on an energy market simulation, where a public utility or a state utility commission sets time-varying rates for a heterogeneous population of prosumers. Our results demonstrate that learned policies can simultaneously achieve economic efficiency, equity across income groups, and stability in energy systems. This work demonstrates how game-theoretic learning frameworks can support data-driven policy design in large-scale strategic environments, with applications to real-world systems like energy markets.

翻译：在多智能体环境中设计具有社会最优性的策略，是经济学与人工智能领域的核心挑战。本文研究在动态不确定环境中学习Stackelberg均衡的通用框架，其中单一主导者与一群自适应跟随者进行交互。受现实紧迫问题（如面向拥有分布式能源资源（如屋顶太阳能与储能系统）的消费者制定公平电价方案）的驱动，我们形式化了一类Stackelberg马尔可夫博弈，并在温和的连续性与单调性条件下确立了平稳Stackelberg均衡的存在性与唯一性。进一步地，我们通过平均场近似将该框架扩展至连续智能体情形，从而得到可解的Stackelberg-平均场均衡（S-MFE）模型。为应对精确最优反应动态的计算棘度，我们引入基于softmax的近似方法，并严格界定了其与真实Stackelberg均衡间的误差。该方法无需完全知晓跟随者目标函数，即可通过策略迭代实现可扩展且稳定的学习。我们在能源市场模拟中验证了该框架：公共事业机构或州公用事业委员会为异质性产消者群体设定时变电价。结果表明，学习策略可同时实现经济效益、跨收入群体公平性与能源系统稳定性。本研究揭示了博弈论学习框架如何支持大规模战略环境中的数据驱动政策设计，其应用可拓展至能源市场等现实系统。