In this paper, we introduce the maximum casual entropy Inverse Reinforcement Learning (IRL) problem for discrete-time mean-field games (MFGs) under an infinite-horizon discounted-reward optimality criterion. The state space of a typical agent is finite. Our approach begins with a comprehensive review of the maximum entropy IRL problem concerning deterministic and stochastic Markov decision processes (MDPs) in both finite and infinite-horizon scenarios. Subsequently, we formulate the maximum casual entropy IRL problem for MFGs - a non-convex optimization problem with respect to policies. Leveraging the linear programming formulation of MDPs, we restructure this IRL problem into a convex optimization problem and establish a gradient descent algorithm to compute the optimal solution with a rate of convergence. Finally, we present a new algorithm by formulating the MFG problem as a generalized Nash equilibrium problem (GNEP), which is capable of computing the mean-field equilibrium (MFE) for the forward RL problem. This method is employed to produce data for a numerical example. We note that this novel algorithm is also applicable to general MFE computations.
翻译:本文针对离散时间均值场游戏(MFGs)在无限时域折扣奖励最优准则下的最大因果熵逆向强化学习(IRL)问题展开研究。典型智能体的状态空间为有限集。研究首先系统回顾了有限时域与无限时域场景下确定性及随机马尔可夫决策过程(MDPs)的最大熵IRL问题。继而,我们构建了面向MFGs的最大因果熵IRL问题——这是一个关于策略的非凸优化问题。通过利用MDPs的线性规划表述,我们将该IRL问题重构为凸优化问题,并设计了一种具有收敛速率的梯度下降算法来计算最优解。最后,我们提出了一种新算法,通过将MFG问题建模为广义纳什均衡问题(GNEP),该算法能够高效求解前向强化学习问题中的均值场均衡(MFE),并基于此方法生成数值算例数据。值得注意的是,该新型算法同样适用于一般MFE的求解。