This paper investigates multi-objective reinforcement learning (MORL), which focuses on learning Pareto optimal policies in the presence of multiple reward functions. Despite MORL's significant empirical success, there is still a lack of satisfactory understanding of various MORL optimization targets and efficient learning algorithms. Our work offers a systematic analysis of several optimization targets to assess their abilities to find all Pareto optimal policies and controllability over learned policies by the preferences for different objectives. We then identify Tchebycheff scalarization as a favorable scalarization method for MORL. Considering the non-smoothness of Tchebycheff scalarization, we reformulate its minimization problem into a new min-max-max optimization problem. Then, for the stochastic policy class, we propose efficient algorithms using this reformulation to learn Pareto optimal policies. We first propose an online UCB-based algorithm to achieve an $\varepsilon$ learning error with an $\tilde{\mathcal{O}}(\varepsilon^{-2})$ sample complexity for a single given preference. To further reduce the cost of environment exploration under different preferences, we propose a preference-free framework that first explores the environment without pre-defined preferences and then generates solutions for any number of preferences. We prove that it only requires an $\tilde{\mathcal{O}}(\varepsilon^{-2})$ exploration complexity in the exploration phase and demands no additional exploration afterward. Lastly, we analyze the smooth Tchebycheff scalarization, an extension of Tchebycheff scalarization, which is proved to be more advantageous in distinguishing the Pareto optimal policies from other weakly Pareto optimal policies based on entry values of preference vectors. Furthermore, we extend our algorithms and theoretical analysis to accommodate this optimization target.
翻译:本文研究多目标强化学习(MORL),其核心在于存在多个奖励函数的情况下学习帕累托最优策略。尽管MORL在实证上取得了显著成功,但对于各种MORL优化目标以及高效学习算法的理解仍不充分。我们的工作对几种优化目标进行了系统性分析,以评估其发现所有帕累托最优策略的能力,以及通过不同目标偏好对所学策略的可控性。随后,我们确定切比雪夫标量化是MORL中一种有利的标量化方法。考虑到切比雪夫标量化的非光滑性,我们将其最小化问题重新表述为一个新的最小-最大-最大优化问题。接着,针对随机策略类,我们利用此重新表述提出了学习帕累托最优策略的高效算法。我们首先提出一种基于在线UCB的算法,对于单个给定偏好,能以$\tilde{\mathcal{O}}(\varepsilon^{-2})$的样本复杂度实现$\varepsilon$的学习误差。为了进一步降低在不同偏好下环境探索的成本,我们提出了一种无偏好框架,该框架首先在没有预定义偏好的情况下探索环境,然后为任意数量的偏好生成解决方案。我们证明其在探索阶段仅需$\tilde{\mathcal{O}}(\varepsilon^{-2})$的探索复杂度,且后续无需额外探索。最后,我们分析了光滑切比雪夫标量化——切比雪夫标量化的一种扩展,基于偏好向量的分量值,它被证明在区分帕累托最优策略与其他弱帕累托最优策略方面更具优势。此外,我们扩展了算法和理论分析以适应此优化目标。