Fairness is an important aspect of decision-making in multi-objective reinforcement learning (MORL), where policies must ensure both optimality and equity across multiple, potentially conflicting objectives. While single-policy MORL methods can learn fair policies for fixed user preferences using welfare functions such as the generalized Gini welfare function (GGF), they fail to provide the diverse set of policies necessary for dynamic or unknown user preferences. To address this limitation, we formalize the fair optimization problem in multi-policy MORL, where the goal is to learn a set of Pareto-optimal policies that ensure fairness across all possible user preferences. Our key technical contributions are threefold: (1) We show that for concave, piecewise-linear welfare functions (e.g., GGF), fair policies remain in the convex coverage set (CCS), which is an approximated Pareto front for linear scalarization. (2) We demonstrate that non-stationary policies, augmented with accrued reward histories, and stochastic policies improve fairness by dynamically adapting to historical inequities. (3) We propose three novel algorithms, which include integrating GGF with multi-policy multi-objective Q-Learning (MOQL), state-augmented multi-policy MOQL for learning non-statoinary policies, and its novel extension for learning stochastic policies. We evaluate our algorithms across various domains and compare our methods against the state-of-the-art MORL baselines. The empirical results show that our methods learn a set of fair policies that accommodate different user preferences.
翻译:公平性是多目标强化学习(MORL)中决策的一个重要方面,策略必须在多个潜在冲突的目标之间同时确保最优性和公平性。虽然单策略MORL方法可以利用广义基尼福利函数(GGF)等福利函数,为固定的用户偏好学习公平策略,但它们无法提供适用于动态或未知用户偏好的多样化策略集。为解决这一局限性,我们形式化了多策略MORL中的公平优化问题,其目标是学习一组帕累托最优策略,确保在所有可能的用户偏好下实现公平性。我们的关键技术贡献包括三个方面:(1)我们证明,对于凹的、分段线性的福利函数(如GGF),公平策略仍然存在于凸覆盖集(CCS)中,该集合是线性标量化下的近似帕累托前沿。(2)我们证明,通过累积奖励历史增强的非平稳策略以及随机策略,能够通过动态适应历史不平等性来提升公平性。(3)我们提出了三种新颖算法,包括将GGF与多策略多目标Q学习(MOQL)集成,用于学习非平稳策略的状态增强多策略MOQL,以及用于学习随机策略的新型扩展。我们在多个领域评估了这些算法,并将我们的方法与最先进的MORL基线进行了比较。实验结果表明,我们的方法学习了一组能够适应不同用户偏好的公平策略。