Markov games model interactions among multiple players in a stochastic, dynamic environment. Each player in a Markov game maximizes its expected total discounted reward, which depends upon the policies of the other players. We formulate a class of Markov games, termed affine Markov games, where an affine reward function couples the players' actions. We introduce a novel solution concept, the soft-Bellman equilibrium, where each player is boundedly rational and chooses a soft-Bellman policy rather than a purely rational policy as in the well-known Nash equilibrium concept. We provide conditions for the existence and uniqueness of the soft-Bellman equilibrium and propose a nonlinear least squares algorithm to compute such an equilibrium in the forward problem. We then solve the inverse game problem of inferring the players' reward parameters from observed state-action trajectories via a projected gradient algorithm. Experiments in a predator-prey OpenAI Gym environment show that the reward parameters inferred by the proposed algorithm outperform those inferred by a baseline algorithm: they reduce the Kullback-Leibler divergence between the equilibrium policies and observed policies by at least two orders of magnitude.
翻译:马尔可夫博弈建模了随机动态环境中多个智能体之间的交互。每个智能体在马尔可夫博弈中最大化其期望总折扣奖励,该奖励取决于其他智能体的策略。我们定义了一类称为仿射马尔可夫博弈的马尔可夫博弈,其中仿射奖励函数耦合了智能体的行为。我们引入了一种新的解概念——软贝尔曼均衡,在该均衡中每个智能体具有有限理性,并选择软贝尔曼策略而非如著名纳什均衡概念中的纯理性策略。我们给出了软贝尔曼均衡存在唯一性的条件,并提出了一个非线性最小二乘算法以在正向问题中计算该均衡。随后,我们通过投影梯度算法,从观察到的状态-动作轨迹中推断智能体的奖励参数,从而求解逆向博弈问题。在OpenAI Gym捕食者-猎物环境中的实验表明,所提算法推断出的奖励参数优于基线算法推断的结果:它们将均衡策略与观测策略之间的库尔贝克-莱布勒散度降低了至少两个数量级。