We study risk-sensitive multi-agent reinforcement learning under general-sum Markov games, where agents optimize the entropic risk measure of rewards with possibly diverse risk preferences. We show that using the regret naively adapted from existing literature as a performance metric could induce policies with equilibrium bias that favor the most risk-sensitive agents and overlook the other agents. To address such deficiency of the naive regret, we propose a novel notion of regret, which we call risk-balanced regret, and show through a lower bound that it overcomes the issue of equilibrium bias. Furthermore, we develop a self-play algorithm for learning Nash, correlated, and coarse correlated equilibria in risk-sensitive Markov games. We prove that the proposed algorithm attains near-optimal regret guarantees with respect to the risk-balanced regret.
翻译:我们研究一般和马尔可夫博弈下的风险敏感多智能体强化学习,其中智能体优化奖励的熵风险度量,且可能具有多样化的风险偏好。我们表明,若直接将现有文献中的遗憾作为性能指标,可能诱导出偏好最风险敏感智能体而忽视其他智能体的均衡偏差策略。针对朴素遗憾的这一缺陷,我们提出一种新的遗憾概念——风险平衡遗憾,并通过下界证明其克服了均衡偏差问题。此外,我们开发了一种自博弈算法,用于学习风险敏感马尔可夫博弈中的纳什均衡、相关均衡和粗相关均衡。我们证明该算法在风险平衡遗憾方面达到了接近最优的遗憾保证。