We study inverse reinforcement learning for discrete-time, infinite-horizon mean-field games (MFGs) under an average-reward criterion. Expert demonstrations are assumed to arise from a stationary mean-field equilibrium under an unknown reward, and the goal is to recover a policy explaining the observed behaviour via the maximum causal entropy principle. We formulate the inverse problem by enforcing consistency with the expert mean-field term and long-run feature expectations, treating two reward classes within a unified occupation-measure framework. For finite-dimensional linear rewards, we give a convex dual reformulation with an explicit log-partition objective, and prove smoothness and curvature properties justifying constant-step-size gradient descent. For infinite-dimensional RKHS rewards, we develop a Lagrangian relaxation whose inner-maximising policy is characterised by a soft Bellman equation. The main obstacle is the absence of a discount-factor contraction. We resolve this by introducing a minorisation-based sub-stochastic kernel that yields a strict contraction of the soft Bellman operator. We establish Fréchet differentiability and Lipschitz smoothness of the log-likelihood score, leading to a gradient ascent algorithm with convergence guarantees. Two numerical examples, a malware-spread MFG and an RKHS-based consumer-choice model, show that the recovered policies closely match expert behaviour.
翻译:我们研究在平均奖励准则下离散时间无穷时域平均场博弈的逆强化学习问题。专家演示假定源于未知奖励下的平稳平均场均衡,目标是通过最大因果熵原理恢复能解释观测行为的策略。通过强制与专家平均场项及长期特征期望的一致性来构建逆问题,并在统一占据测度框架内处理两类奖励函数。对于有限维线性奖励,我们给出具有显式对数配分目标的凸对偶重构,并证明支持常步长梯度下降的光滑性与曲率性质。对于无限维再生核希尔伯特空间奖励,我们发展出拉格朗日松弛方法,其内层最大化策略由软贝尔曼方程刻画。主要障碍在于缺乏折扣因子压缩性。我们通过引入基于极小化的次随机核解决该问题,该核能实现软贝尔曼算子的严格压缩。建立了对数似然得分的Fréchet可微性与Lipschitz光滑性,进而得到具有收敛保证的梯度上升算法。两个数值示例——恶意软件传播平均场博弈与基于再生核希尔伯特空间的消费者选择模型——表明恢复策略与专家行为高度吻合。