Multi-objective reinforcement learning (MORL) algorithms tackle sequential decision problems where agents may have different preferences over (possibly conflicting) reward functions. Such algorithms often learn a set of policies (each optimized for a particular agent preference) that can later be used to solve problems with novel preferences. We introduce a novel algorithm that uses Generalized Policy Improvement (GPI) to define principled, formally-derived prioritization schemes that improve sample-efficient learning. They implement active-learning strategies by which the agent can (i) identify the most promising preferences/objectives to train on at each moment, to more rapidly solve a given MORL problem; and (ii) identify which previous experiences are most relevant when learning a policy for a particular agent preference, via a novel Dyna-style MORL method. We prove our algorithm is guaranteed to always converge to an optimal solution in a finite number of steps, or an $\epsilon$-optimal solution (for a bounded $\epsilon$) if the agent is limited and can only identify possibly sub-optimal policies. We also prove that our method monotonically improves the quality of its partial solutions while learning. Finally, we introduce a bound that characterizes the maximum utility loss (with respect to the optimal solution) incurred by the partial solutions computed by our method throughout learning. We empirically show that our method outperforms state-of-the-art MORL algorithms in challenging multi-objective tasks, both with discrete and continuous state and action spaces.
翻译:多目标强化学习算法处理的是智能体可能对不同(彼此冲突的)奖励函数具有不同偏好的序贯决策问题。此类算法通常学习一组策略集(每个策略针对特定智能体偏好优化),这些策略后续可用于解决具有新偏好的问题。我们提出一种新颖算法,利用广义策略改进(GPI)定义原则性且形式化推导的优先级方案,以提升样本高效学习能力。该算法实现主动学习策略,使智能体能够:(i)在每一时刻识别最有前景的偏好/目标进行训练,从而更快解决给定的多目标强化学习问题;(ii)通过新型Dyna式多目标强化学习方法,识别学习特定智能体偏好策略时最具相关性的历史经验。我们证明该算法保证在有限步数内收敛至最优解,若智能体受限且仅能识别可能非最优策略,则收敛至$\epsilon$-最优解($\epsilon$有界)。同时证明该方法在学习过程中单调提升局部解的质量。最后,我们提出一个边界来刻画学习方法计算出的局部解相对于最优解的最大效用损失。实验结果表明,在离散与连续状态动作空间的复杂多目标任务中,本方法性能优于当前最优的多目标强化学习算法。