Classical multi-agent reinforcement learning (MARL) assumes risk neutrality and complete objectivity for agents. However, in settings where agents need to consider or model human economic or social preferences, a notion of risk must be incorporated into the RL optimization problem. This will be of greater importance in MARL where other human or non-human agents are involved, possibly with their own risk-sensitive policies. In this work, we consider risk-sensitive and non-cooperative MARL with cumulative prospect theory (CPT), a non-convex risk measure and a generalization of coherent measures of risk. CPT is capable of explaining loss aversion in humans and their tendency to overestimate/underestimate small/large probabilities. We propose a distributed sampling-based actor-critic (AC) algorithm with CPT risk for network aggregative Markov games (NAMGs), which we call Distributed Nested CPT-AC. Under a set of assumptions, we prove the convergence of the algorithm to a subjective notion of Markov perfect Nash equilibrium in NAMGs. The experimental results show that subjective CPT policies obtained by our algorithm can be different from the risk-neutral ones, and agents with a higher loss aversion are more inclined to socially isolate themselves in an NAMG.
翻译:经典的多智能体强化学习(MARL)假设智能体具有风险中性与完全客观性。然而,在需要智能体考虑或建模人类经济或社会偏好的场景中,必须将风险概念纳入强化学习优化问题。这在涉及其他人类或非人类智能体(可能各自执行其风险敏感策略)的MARL中更为重要。本研究考虑基于累积前景理论(CPT)的风险敏感非合作MARL——一种非凸风险度量,也是相干风险度量的泛化。CPT能解释人类的损失厌恶倾向,以及其对小概率事件的高估/对大概率事件的低估。针对网络聚合马尔可夫博弈(NAMG),我们提出了一种基于分布式采样的含CPT风险的演员-评论家(AC)算法,称为分布式嵌套CPT-AC。在特定假设条件下,我们证明了该算法在NAMG中收敛至马尔可夫完美纳什均衡的主观概念。实验结果表明,我们的算法获得的CPT主观策略可能与风险中性策略存在差异,且损失厌恶程度更高的智能体在NAMG中更倾向于社会孤立。