Sequential decision making in the real world often requires finding a good balance of conflicting objectives. In general, there exist a plethora of Pareto-optimal policies that embody different patterns of compromises between objectives, and it is technically challenging to obtain them exhaustively using deep neural networks. In this work, we propose a novel multi-objective reinforcement learning (MORL) algorithm that trains a single neural network via policy gradient to approximately obtain the entire Pareto set in a single run of training, without relying on linear scalarization of objectives. The proposed method works in both continuous and discrete action spaces with no design change of the policy network. Numerical experiments in benchmark environments demonstrate the practicality and efficacy of our approach in comparison to standard MORL baselines.
翻译:现实世界中的序贯决策往往需要在冲突目标之间找到良好的平衡。通常,存在大量体现不同目标间折衷模式的帕累托最优策略,而利用深度神经网络穷举获得这些策略在技术上极具挑战性。本文提出一种新的多目标强化学习(MORL)算法,通过策略梯度训练单个神经网络,在单次训练过程中近似获得整个帕累托集,且无需依赖目标的线性标量化。该方法在连续和离散动作空间中均适用,且策略网络设计无需改变。基准环境中的数值实验表明,与标准多目标强化学习基线相比,我们的方法具有实用性和有效性。