Learning stationary policies in infinite-horizon general-sum Markov games (MGs) remains a fundamental open problem in Multi-Agent Reinforcement Learning (MARL). While stationary strategies are preferred for their practicality, computing stationary forms of classic game-theoretic equilibria is computationally intractable -- a stark contrast to the comparative ease of solving single-agent RL or zero-sum games. To bridge this gap, we study Risk-averse Quantal response Equilibria (RQE), a solution concept rooted in behavioral game theory that incorporates risk aversion and bounded rationality. We demonstrate that RQE possesses strong regularity conditions that make it uniquely amenable to learning in MGs. We propose a novel two-timescale Actor-Critic algorithm characterized by a fast-timescale actor and a slow-timescale critic. Leveraging the regularity of RQE, we prove that this approach achieves global convergence with finite-sample guarantees. We empirically validate our algorithm in several environments to demonstrate superior convergence properties compared to risk-neutral baselines.
翻译:在无限时域一般和马尔可夫博弈中学习平稳策略仍然是多智能体强化学习领域一个基本的开放性问题。尽管平稳策略因其实用性而备受青睐,但计算经典博弈论均衡的平稳形式在计算上是不可行的——这与求解单智能体强化学习或零和博弈的相对容易性形成鲜明对比。为弥合这一差距,我们研究了风险规避量化响应均衡,这是一种植根于行为博弈论的解概念,它融合了风险规避和有限理性。我们证明RQE具备强大的正则性条件,使其特别适合在马尔可夫博弈中进行学习。我们提出了一种新颖的双时间尺度行动者-评论家算法,其特点在于快时间尺度的行动者和慢时间尺度的评论家。利用RQE的正则性,我们证明了该方法能在有限样本保证下实现全局收敛。我们在多个环境中通过实验验证了我们的算法,结果表明其相较于风险中性基线方法具有更优的收敛特性。