Learning stationary policies in infinite-horizon general-sum Markov games (MGs) remains a fundamental open problem in Multi-Agent Reinforcement Learning (MARL). While stationary strategies are preferred for their practicality, computing stationary forms of classic game-theoretic equilibria is computationally intractable -- a stark contrast to the comparative ease of solving single-agent RL or zero-sum games. To bridge this gap, we study Risk-averse Quantal response Equilibria (RQE), a solution concept rooted in behavioral game theory that incorporates risk aversion and bounded rationality. We demonstrate that RQE possesses strong regularity conditions that make it uniquely amenable to learning in MGs. We propose a novel single-timescale Actor-Critic algorithm characterized by a faster actor and a slower critic. Leveraging the regularity of RQE, we prove that this approach achieves global convergence with finite-sample guarantees. We empirically validate our algorithm in several environments to demonstrate superior convergence properties compared to risk-neutral baselines.
翻译:在无限时域一般和马尔可夫博弈(MG)中学习平稳策略,仍然是多智能体强化学习(MARL)领域一个基础性的开放问题。尽管平稳策略因其实用性而受到青睐,但计算经典博弈论均衡的平稳形式在计算上是棘手的——这与解决单智能体强化学习或零和博弈的相对简易性形成鲜明对比。为弥补这一差距,我们研究了风险规避分位数响应均衡(RQE),这是一种根植于行为博弈论的概念,融合了风险规避与有限理性。我们证明了RQE具备强正则性条件,使其特别适用于MG中的学习。我们提出了一种新型单时间尺度演员-评论家算法,其特点是演员更新较快而评论家更新较慢。借助RQE的正则性,我们证明了该方法能以有限样本保证实现全局收敛。我们在多个环境中对算法进行了实证验证,结果表明相较于风险中性基线方法,本算法具有更优的收敛性能。