In traditional reinforcement learning (RL), the learner aims to solve a single objective optimization problem: find the policy that maximizes expected reward. However, in many real-world settings, it is important to optimize over multiple objectives simultaneously. For example, when we are interested in fairness, states might have feature annotations corresponding to multiple (intersecting) demographic groups to whom reward accrues, and our goal might be to maximize the reward of the group receiving the minimal reward. In this work, we consider a multi-objective optimization problem in which each objective is defined by a state-based reweighting of a single scalar reward function. This generalizes the problem of maximizing the reward of the minimum reward group. We provide oracle-efficient algorithms to solve these multi-objective RL problems even when the number of objectives is exponentially large-for tabular MDPs, as well as for large MDPs when the group functions have additional structure. Finally, we experimentally validate our theoretical results and demonstrate applications on a preferential attachment graph MDP.
翻译:在传统强化学习(RL)中,学习者的目标是解决单一目标优化问题:寻找能最大化期望奖励的策略。然而,在许多现实场景中,同时优化多个目标至关重要。例如,当我们关注公平性问题时,状态可能包含对应多个(交叉)人口统计群体的特征标注,奖励会累积至这些群体,而我们的目标可能是最大化获得最低奖励的群体的奖励。本文研究了一个多目标优化问题,其中每个目标通过基于状态的单一标量奖励函数重加权来定义。这推广了最大化最低奖励群体奖励的问题。我们提出了预言机高效算法来解决这些多目标强化学习问题,即使目标数量呈指数级增长——适用于表格化马尔可夫决策过程(MDP),以及当群体函数具有附加结构时的大型MDP。最后,我们通过实验验证了理论结果,并在偏好依附图MDP上展示了应用案例。