Many sequential decision-making problems need optimization of different objectives which possibly conflict with each other. The conventional way to deal with a multi-task problem is to establish a scalar objective function based on a linear combination of different objectives. However, for the case of having conflicting objectives with different scales, this method needs a trial-and-error approach to properly find proper weights for the combination. As such, in most cases, this approach cannot guarantee an optimal Pareto solution. In this paper, we develop a single-agent scale-independent multi-objective reinforcement learning on the basis of the Advantage Actor-Critic (A2C) algorithm. A convergence analysis is then done for the devised multi-objective algorithm providing a convergence-in-mean guarantee. We then perform some experiments over a multi-task problem to evaluate the performance of the proposed algorithm. Simulation results show the superiority of developed multi-objective A2C approach against the single-objective algorithm.
翻译:许多序列决策问题需要优化多个可能相互冲突的目标。传统处理多任务问题的方法是基于不同目标的线性组合构建标量目标函数。然而,当存在尺度不同的冲突目标时,该方法需要通过试错方式为组合权重寻找合适的取值。因此,在大多数情况下,这种方法无法保证获得帕累托最优解。本文基于优势演员-评论家(A2C)算法,提出了一种面向单智能体的尺度无关多目标强化学习方法。随后对该多目标算法进行了收敛性分析,证明了其均值收敛性。我们通过多任务问题的实验验证了所提算法的性能。仿真结果表明,所开发的多目标A2C方法相较于单目标算法具有显著优越性。