We consider the problem of episodic reinforcement learning where there are multiple stakeholders with different reward functions. Our goal is to output a policy that is socially fair with respect to different reward functions. Prior works have proposed different objectives that a fair policy must optimize including minimum welfare, and generalized Gini welfare. We first take an axiomatic view of the problem, and propose four axioms that any such fair objective must satisfy. We show that the Nash social welfare is the unique objective that uniquely satisfies all four objectives, whereas prior objectives fail to satisfy all four axioms. We then consider the learning version of the problem where the underlying model i.e. Markov decision process is unknown. We consider the problem of minimizing regret with respect to the fair policies maximizing three different fair objectives -- minimum welfare, generalized Gini welfare, and Nash social welfare. Based on optimistic planning, we propose a generic learning algorithm and derive its regret bound with respect to the three different policies. For the objective of Nash social welfare, we also derive a lower bound in regret that grows exponentially with $n$, the number of agents. Finally, we show that for the objective of minimum welfare, one can improve regret by a factor of $O(H)$ for a weaker notion of regret.
翻译:我们考虑具有多个利益相关者且每个利益相关者具有不同奖励函数的回合制强化学习问题。我们的目标是输出一个相对于不同奖励函数具有社会公平性的策略。已有研究提出了公平策略必须优化的不同目标,包括最小福利和广义基尼福利。我们首先从公理化的角度审视这一问题,并提出了任何此类公平目标必须满足的四条公理。我们证明纳什社会福利是唯一同时满足所有四条公理的目标,而先前提出的目标未能满足全部四条公理。随后,我们考虑该问题的学习版本,即底层模型(马尔可夫决策过程)未知。我们研究了在最小化后悔值方面的问题,该后悔值针对的是最大化三种不同公平目标(最小福利、广义基尼福利和纳什社会福利)的公平策略。基于乐观规划方法,我们提出了一种通用学习算法,并推导了其在三种不同策略下的后悔界。针对纳什社会福利目标,我们还推导出一个随智能体数量n呈指数增长的后悔值下界。最后,我们证明对于最小福利目标,可以通过一个因子O(H)来改进一种较弱后悔值定义下的表现。