Fairness plays a crucial role in various multi-agent systems (e.g., communication networks, financial markets, etc.). Many multi-agent dynamical interactions can be cast as Markov Decision Processes (MDPs). While existing research has focused on studying fairness in known environments, the exploration of fairness in such systems for unknown environments remains open. In this paper, we propose a Reinforcement Learning (RL) approach to achieve fairness in multi-agent finite-horizon episodic MDPs. Instead of maximizing the sum of individual agents' value functions, we introduce a fairness function that ensures equitable rewards across agents. Since the classical Bellman's equation does not hold when the sum of individual value functions is not maximized, we cannot use traditional approaches. Instead, in order to explore, we maintain a confidence bound of the unknown environment and then propose an online convex optimization based approach to obtain a policy constrained to this confidence region. We show that such an approach achieves sub-linear regret in terms of the number of episodes. Additionally, we provide a probably approximately correct (PAC) guarantee based on the obtained regret bound. We also propose an offline RL algorithm and bound the optimality gap with respect to the optimal fair solution. To mitigate computational complexity, we introduce a policy-gradient type method for the fair objective. Simulation experiments also demonstrate the efficacy of our approach.
翻译:公平性在各类多智能体系统(如通信网络、金融市场等)中起着关键作用。许多多智能体动态交互过程可建模为马尔可夫决策过程(MDP)。现有研究主要关注已知环境下的公平性,但未知环境下此类系统中的公平性探索仍属开放问题。本文提出一种基于强化学习(RL)的方法,以在有限时域多智能体情节式MDP中实现公平性。我们引入公平性函数替代传统最大化单智能体价值函数之和的目标,该函数确保各智能体获得均衡奖励。由于经典贝尔曼方程在非最大化单智能体价值函数之和时不再成立,传统方法无法适用。为此,我们通过保持未知环境的置信界进行探索,并提出一种基于在线凸优化的方法,在约束于该置信区域的策略空间中求解。理论分析表明,该方法在情节数维度上可实现次线性遗憾值。此外,基于所得遗憾界,我们给出了概率近似正确(PAC)保证。我们还提出一种离线强化学习算法,并界定了其与最优公平解之间的最优性差距。为降低计算复杂度,我们针对公平性目标引入策略梯度类方法。仿真实验进一步验证了所提方法的有效性。